support for static configuration of k8s cpu manager - container level cpu affinity #878

egernst · 2018-11-02T20:29:32Z

Description of problem

We only think in terms of vCPUS today, that is shares/quota.

Expected result

We need to look at CPU affinity, and in particular the Cpus field:

https://github.com/kubernetes-sigs/cri-o/blob/master/vendor/github.com/opencontainers/runtime-spec/specs-go/config.go#L304

In a default kubernetes configuration, this mask should be set for all CPUs. However, if the cpu-manager is configured as static, it is possible to start setting CPU affinities on a best effort basis on a container granularity. With this in place, you'll see specific masks unique to each container.

Actual result

Today, if a user were to setup a mixed cluster with runc and kata, the kata runtime ignores the CPU set passed in, resulting in the vCPU (and vhost) threads running across all available CPUs (no isol is in place - affinity is managed by kubelet itself). This would result in kata based containers not only not getting a performance tuned affinity, but it would also result in our containers likely utilizing CPUs which kubelet wanted dedicated.

Proposal

Mandatory:

The following would need to be done in order to make sure we aren't utilizing CPUs dedicated to other pods.

Augment virtcontainers to track Cpus field provided as part of the UpdateContainer's runtime spec field. The sandbox-level cpuset mask, which would be || of all container cpu-set masks, would be utililized to constrain vCPU threads (and perhaps vhost threads).
Track the PID(s) associated with the sandbox's QEMU, and taskset them based on the 'or' of each individual container's CPUSet mask. Same would be needed for vhost and iothreads.
Update this sandbox-level mask each time an UpdateContainers call is made which includes an updated CPUset mask.

Optimally, but secondary compared to the first set of changes:

With the mandatory bits in place, we'll be using the CPU set provided, but we won't be providing CPU affinity on a per container basis. To fully support CPU affinity in K8S, we'd also need to:

Track mapping of physical CPUs to vCPUs inside the guest
pin container processes to a particular vCPU set based on CPU mapping.
Look into documentation on allocating more CPUs in the system pool for running non-container vCPUs associated with Kata (ie, run vhost threads and / or shim processes on a particular CPU set as well)

Original logic would not allow a container to be updated if it wasn't in the running state. A container should be able to be updated if it is either running or is created and in ready state. This better matches the logic in CRI-O today, and allows for the use case where a created container is updated by due to K8S CPU manager adjusting the CPU affinity of a created but not yet running container. Fixes kata-containers#878 Signed-off-by: Eric Ernst <eric.ernst@intel.com>

sboeuf · 2018-11-02T22:29:46Z

@egernst regarding the mandatory requirement:

Reset this mask each time an UpdateContainers call is made which includes an updated CPUset mask.

Do you mean update by reset? Just want to make sure it's clear that every time a new UpdateContainers occurs, the mask should be updated.

Temporarily disabled `cpuManagerPolicy=static`, issue below kata-containers/runtime#878 Provided kata equivalent yaml for cpumanager test Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>

mcastelino · 2019-03-19T22:19:38Z

/cc @devimc has this been addressed?

jcvenegas · 2019-03-19T22:32:07Z

@mcastelino

cpuset are honor at host level for vcpus this mean but iothreads still be jumping around not dedicated vcpus.
Today is possible to run static policy workloads on Kata and k8s, but there are missing the following optimizations.

cpupining of vcpus and cpus
vcpu mappaing of cpus and vcpus.

mcastelino · 2019-03-19T22:46:39Z

@jcvenegas what will nproc inside the kata container show

jcvenegas · 2019-03-19T23:20:29Z

This is not consistent today.
Inside the katacontainer , nproc will report the amount of vcpus added to the cpuset cgroup, for most cases will report the amount of vcpus in the VM.
Why?

The guest cpuset cgroup is applied when the cpuset request is a subset of vcpus.

Example:

Case I

+------------------------------------------------------+
|     ContainerA|                    Container B       |
|      Cpuset: 1|                    cpuset: 2-3       |
|               |                                      |
|               |                                      |
|               |                                      |
+------------------------------------------------------+
|   cpu 1       | cpu 2            |  cpu 3            |
+------------------------------------------------------+
|                                                      |
|                 VM                                   |
+------------------------------------------------------+

The VM has 1-3 online vcpus
There is a pod definition with 2 containers:
A: cpuset: 1
B cpuset : 2-3
The nproc output will be 1 for conatainerA and 2 for containerB

Case II

The case I only is valid for the first garanteed pod that is created. If a similar pod is created again the static manager will assign other set of cpus that will not match inside the guest, in this case the agent will ignore this cpuset request and nproc output will be the amount of vcpus in the guest (3 for the case below)

+------------------------------------------------------+
|     ContainerA|                    Container B       |
|      Cpuset: 4|                    cpuset: 5-6       |
|               |                                      |
|               |                                      |
|               |                                      |
+------------------------------------------------------+
|   cpu 1       | cpu 2            |  cpu 3            |
+------------------------------------------------------+
|                                                      |
|                 VM                                   |
+------------------------------------------------------+

The vcpu container mapping is still required to have a runc like output.

bergwolf · 2019-03-20T10:11:50Z

@jcvenegas In the second case, I think kata-runtime should take action and adjust guest process cpuset accordingly. Namingly:

pin qemu vCPU to host cpu
maintain an internal mapping to mark that vcpu1-3 is actually mapped to host cpu 4-6.
adjust cpuset in container spec to honor the vCPU to CPU mapping before sending the spec to kata-agent.

jcvenegas · 2019-03-27T22:02:27Z

Added sub task related with cpumanager static policy.

#1430

bergwolf · 2019-03-28T03:27:35Z

Hey guys, I've got a chance to look at this again. Per kubernetes cpu manager policy, IMO we need to handle BestEffort/Burstable vs. Guaranteed differently. For containers runs in the shared CPU pool, we let them compete with kata components as well. For guaranteed containers, we need to make sure nothing competes with them even in the guest.

With a static kubernetes CPU manger policy, node CPU resources are actually divided into four parts:

system reserved: reserved for all other system programes
kubelet reserved: reserved for kubelet process
shared CPUs pool: all CPU resource minus system reserved and kubelet reserved
dedicated CPUs: taken from shared CPUs pool whenever requested

To achieve it, when k8s static CPU manager policy is enabled, every container should have a cpuset cgroup contraints otherwise it will compete with dedicated CPU containers. (Need confirmation but I think it is the only way to work).

So I propose a global design principle/choice to the question of where to put the kata components in cpu cgroups: put it in the shared CPU pool.

Following such a principle, I propose the following design to integrate kata with k8s static CPU manager:

host side kata components (kata-shim, kata-proxy, qemu main thread, qemu io threads etc.) use the shared CPU pool, but do not consume container CPU quota. IOW, they have the same cpuset with the shared CPU pool, but in a different cpu cgroup so that they do not share container CPU quota. It will require k8s to set cpuset for the very first pause container as well, if it doesn't, let's fix it in crio/containerd.
guest side kata components (guest kernel threads, kata agent, systemd related staffs) use the shared CPU pool. It can be achieved by starting the guest with a default number of vCPUs and put them in the same cpuset cgroup as host side kata components.
containers with shared CPU pool constraints (cpuset + fractional CUP quota): hotplug required vCPUs to the guest, add these vCPU threads to container cpu cgroup (cpuset+quota).
containers with dedicated CPUconstraints (cpuset + integer quantity CPU quota): hotplug required vCPUs to the guest, add these vCPU threads to container cpu cgroup (cpuset+quota).
before creating containers in the guest, modify its spec to reflect the vcpu/cpu mapping change.
allow to update a container's cgroup cpuset when it is running as k8s would ask to refresh it when host CPUs are taken from or added to the shared CPU pool.

wdyt? @egernst @WeiZhang555 @sboeuf @kata-containers/architecture-committee

WeiZhang555 · 2019-03-28T03:47:26Z

@bergwolf I think it makes sense, I like the idea.

But it sounds soooooooo complicated, now the picture is a mess in my mind though I can understand each detailed item 🤣

bergwolf · 2019-03-28T05:29:51Z

@WeiZhang555 The idea is to let kata components compete with containers in the shared CPU pool. All the steps are there to make it happen. Let me put up some slides and bring it up in the next AC meeting.

sboeuf · 2019-03-28T15:15:06Z

@bergwolf 👍

mcastelino · 2019-03-28T23:11:59Z

@bergwolf using the shared pool for QEMU will cause problems. When you perform auto scaling, it is done based on the actual resource usage of the pod. If the QEMU overhead is not accounted for properly we will not scale correctly.

Also if the user needs a pod that has the same performance as a runc based pod, we should request higher amount of resources. This will help size the resource/performance requirement when running with Kata.

bergwolf · 2019-03-29T02:00:14Z

@mcastelino I would expect only use the same cpuset of the shared CPU pool for kata components but not use any containers's CPU quota. It means that QEMU processes will compete with containers on the same node for shared pCPUs with no guaranteed CPU shares.

As for auto-scaling scheduling, sandbox overhead needs to be counted at the kubernetes level and it is not something we can solve in kata containers. Right now it is not counted. And I don't see my proposal changing any aspect of the current situation.

Also if the user needs a pod that has the same performance as a runc based pod, we should request higher amount of resources.

No, we don't need to request more for kata components. We just make sure kata components do not use containers' CPU quota. Then containers get what they ask for, and kata components are left to compete freely for the remaining shared CPU resources.

krsna1729 · 2019-03-29T02:29:25Z

@bergwolf would it mean/be possible a burstable pod's overall performance suffers since its iothreads are not receiving CPU quota guarantees?

bergwolf · 2019-03-29T03:19:03Z

@krsna1729 Yes, it is possible. The situation is the same with runc, in which case kernel iothreads are not getting cpu guarantee either. The reason (for it being ignored right now) is that CPU is not usually in the critical path for IO performance.

However, I agree that we need further tuning for very high speed IO devices if we want to outperform runc, -- we need a way to allocate CPU quota or even cpuset for iothreads without hurting container performance. But it is really a special case and different devices may require different handling (I'm thinking of dpdk, spdk, nvme, all kinds of vhost-user-[net,blk,fs] etc.). So I prefer to make the general case work first.

mcastelino · 2019-04-04T01:04:16Z

@bergwolf @krsna1729 @egernst @jcvenegas @devimc I wanted to completely capture the behavior of k8s as it exists today w.r.t cgroups (cpu, cpuset, memory). It is a bit too long to put down here but captured in a gist

https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f

Please let me know if any of these observations are incorrect. Or if you have a way to setup system isolation in a manner in which you get behavior described in #878 (comment)

A quick summary

cpusets today are split into only two pools (guaranteed and everything else)
kube and system overheads are not captured for cpusets and not in either of these pools
The assured scheduling kube and system components is only managed via cpu shares
Total Memory limits are imposed on the kubepods
Total memory limits are not imposed on the system and kube components

This should hopefully result in a properly informed more productive discussion.

…nch-bump # Kata Containers 1.13.0-alpha0

egernst added the feature New functionality label Nov 2, 2018

egernst changed the title ~~support for static configuration of k8s cpu manager - contains level cpu affinity~~ support for static configuration of k8s cpu manager - container level cpu affinity Nov 2, 2018

This was referenced Nov 2, 2018

virtcontainers: add support for CPU affinity at pod granularity #879

Closed

Add support for CPU pinning. #513

Closed

egernst mentioned this issue Nov 2, 2018

virtcontainers: Make sandbox manage VM CPUs. #833

Closed

egernst mentioned this issue Dec 28, 2018

kata not supporting cpuset/cpus via cri? #1079

Closed

krsna1729 mentioned this issue Jan 3, 2019

Kata containers cannot support static QoS class. #1083

Closed

krsna1729 mentioned this issue Jan 3, 2019

Temporarily disable cpuManagerPolicy=static clearlinux/cloud-native-setup#22

Merged

jcvenegas self-assigned this Jan 3, 2019

This was referenced Jan 14, 2019

[RFC] [Proposal] support cgroups in the host #1125

Closed

add cpuset cgroup on the host. #1157

Closed

This was referenced Mar 28, 2019

[RFC]cgroup: Move all qemu to sandbox cgroup #1431

Closed

hypervisor: return cpu->threadID mapping #1436

Merged

jodh-intel added this to To do in Issue backlog Aug 10, 2020

egernst added a commit to egernst/runtime that referenced this issue Feb 9, 2021

Merge pull request kata-containers#878 from egernst/1.13.0-alpha0-bra…

27b90c2

…nch-bump # Kata Containers 1.13.0-alpha0

jodh-intel closed this as completed Apr 7, 2021

Issue backlog automation moved this from To do to Done Apr 7, 2021

YushuoEdge mentioned this issue Sep 7, 2022

cpuset: vcpu-handling when creating multiple containers in one pod kata-containers/kata-containers#5050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for static configuration of k8s cpu manager - container level cpu affinity #878

support for static configuration of k8s cpu manager - container level cpu affinity #878

egernst commented Nov 2, 2018 •

edited

Loading

sboeuf commented Nov 2, 2018

mcastelino commented Mar 19, 2019

jcvenegas commented Mar 19, 2019

mcastelino commented Mar 19, 2019 •

edited

Loading

jcvenegas commented Mar 19, 2019

bergwolf commented Mar 20, 2019

jcvenegas commented Mar 27, 2019

bergwolf commented Mar 28, 2019

WeiZhang555 commented Mar 28, 2019

bergwolf commented Mar 28, 2019

sboeuf commented Mar 28, 2019

mcastelino commented Mar 28, 2019

bergwolf commented Mar 29, 2019

krsna1729 commented Mar 29, 2019

bergwolf commented Mar 29, 2019

mcastelino commented Apr 4, 2019

support for static configuration of k8s cpu manager - container level cpu affinity #878

support for static configuration of k8s cpu manager - container level cpu affinity #878

Comments

egernst commented Nov 2, 2018 • edited Loading

Description of problem

Expected result

Actual result

Proposal

Mandatory:

Optimally, but secondary compared to the first set of changes:

sboeuf commented Nov 2, 2018

mcastelino commented Mar 19, 2019

jcvenegas commented Mar 19, 2019

mcastelino commented Mar 19, 2019 • edited Loading

jcvenegas commented Mar 19, 2019

Case I

Case II

bergwolf commented Mar 20, 2019

jcvenegas commented Mar 27, 2019

bergwolf commented Mar 28, 2019

WeiZhang555 commented Mar 28, 2019

bergwolf commented Mar 28, 2019

sboeuf commented Mar 28, 2019

mcastelino commented Mar 28, 2019

bergwolf commented Mar 29, 2019

krsna1729 commented Mar 29, 2019

bergwolf commented Mar 29, 2019

mcastelino commented Apr 4, 2019

egernst commented Nov 2, 2018 •

edited

Loading

mcastelino commented Mar 19, 2019 •

edited

Loading