Max hotplugged memory should not exceed actual max host memory #4516

skaegi · 2022-06-22T22:32:35Z

We're running into a problem when memory "limits" are used in our Kubernetes pods that work with runc but fail when run with Kata. Kata will look at the memory "limit" and try to hotplug the sum of all container memory limits into the guest. This works fine except if the memory exceeds the actual host's max memory in which case we get an error like this...

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Created    11s (x2 over 17s)  kubelet            Created container memory-boom
  Normal   Pulled     11s                kubelet            Successfully pulled image "polinux/stress" in 380.789628ms
  Warning  Failed     17s (x2 over 23s)  kubelet            Error: failed to create containerd task: failed to create shim: Unable to hotplug 1024000 MiB memory, the SB has 4096 MiB and the maximum amount is 31753 MiB: unknow
  Warning  BackOff    4s (x2 over 5s)    kubelet            Back-off restarting failed container

--
Here's an example Pod to illustrate (this Pod works in standard runc) ...

apiVersion: v1
kind: Pod
metadata:
  name: memory-boom-untrusted
spec:
  runtimeClassName: kata
  containers:
  - name: memory-boom
    image: polinux/stress
    resources:
      requests:
        memory: "50Mi"
        cpu: "50m"
      limits:
        memory: "1000Gi"
        cpu: "500"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]

The above Pod is contrived however we are hitting this problem in our hosted solution.

For CPU limits the hotplugged vcpus do not exceed the hosts. Can we have the same support for memory so that we do not try to hotplug more memory than what the host has.

Less critical, but it would be great if this support could be generalized to supporting something like default_maxmemory analagous to the current default_maxvcpus support in our runtime-configuration.

The text was updated successfully, but these errors were encountered:

skaegi · 2022-06-22T22:34:54Z

I didn't label this issue as a "bug" but perhaps it should be since these high memory limit pods work with runc.

liubin · 2022-06-23T03:44:46Z

This is an interesting feature.
/cc @kata-containers/architecture-committee

fidencio · 2022-06-23T08:33:03Z

@skaegi, thinking out loud here, what's the behaviour on that?

I'm afraid hot-plugging all the memory we have (even if less than requested) wouldn't actually fly as it could lead to resources starvation.

Let's consider we use the suggested default_maxmemory, would this prevent the pod to start or would it hotplug the default_maxmemory amount?

That's indeed an interesting bug to solve.

skaegi · 2022-06-23T13:09:29Z

I definitely do not want a pod being prevented from starting because of a limit. As a user I don't want to try and play the game where a container has to guess how to set its limits based on how much memory is on the host.

Kata is hot-plugging the memory space and not reserving it -- e.g. we are already and very much on purpose running over-committed which is how Kubernetes is always run so this is not a resource starvation concern. If a user is very conservative and wants to minimize the chance of the oom-killer from coming into play as the host runs low on memory they can make sure the resource 'request' is equal to the 'limit'. We do that today in some testing circumstances and it works well.

Off the top of my head the behavior I think makes the most sense is...

If there is a non-zero default_maxmemory present we should hotplug up to that amount but not higher than host memory.
If default_maxmemory zero we should hotplug up to the host's memory space.

This is identical to the cpu hotplugging so not novel. In either case we should not prevent the pod from starting for a limit.

skaegi · 2022-06-23T16:46:20Z

I'm onboard for providing tests for this feature and if you think it's simple I could try coding it too with some guidance.

fidencio · 2022-06-24T09:31:57Z

I will cook up something for that no later than next week.

Let's add a `default_maxmemory` configuration, which allows the admins to set the maximum amount of memory to be used by a VM, considering the initial amount + whatever ends up being hotplugged via the pod limits. By default this value is 0 (zero), and it means that the whole physical RAM is the limit. While implementing this, a change of behaviour (or a bug fix, depending on how you see it) has been introduced as if a pod requests more memory than the amount avaiable in the host, instead of failing to start the pod, we simply hotplug the maximum amount of memory available, mimicing better the runc behaviour. Fixes: kata-containers#4516 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

Let's adapt Cloud Hypervisor's and QEMU's code to properly behave to the newly added `default_maxmemory` config. While implementing this, a change of behaviour (or a bug fix, depending on how you see it) has been introduced as if a pod requests more memory than the amount avaiable in the host, instead of failing to start the pod, we simply hotplug the maximum amount of memory available, mimicing better the runc behaviour. Fixes: kata-containers#4516 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>