Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

support for static configuration of k8s cpu manager - container level cpu affinity #878

Closed
6 tasks
egernst opened this issue Nov 2, 2018 · 16 comments
Closed
6 tasks
Assignees
Labels
feature New functionality

Comments

@egernst
Copy link
Member

egernst commented Nov 2, 2018

Description of problem

We only think in terms of vCPUS today, that is shares/quota.

Expected result

We need to look at CPU affinity, and in particular the Cpus field:

https://github.com/kubernetes-sigs/cri-o/blob/master/vendor/github.com/opencontainers/runtime-spec/specs-go/config.go#L304

In a default kubernetes configuration, this mask should be set for all CPUs. However, if the cpu-manager is configured as static, it is possible to start setting CPU affinities on a best effort basis on a container granularity. With this in place, you'll see specific masks unique to each container.

Actual result

Today, if a user were to setup a mixed cluster with runc and kata, the kata runtime ignores the CPU set passed in, resulting in the vCPU (and vhost) threads running across all available CPUs (no isol is in place - affinity is managed by kubelet itself). This would result in kata based containers not only not getting a performance tuned affinity, but it would also result in our containers likely utilizing CPUs which kubelet wanted dedicated.

Proposal

Mandatory:

The following would need to be done in order to make sure we aren't utilizing CPUs dedicated to other pods.

  • Augment virtcontainers to track Cpus field provided as part of the UpdateContainer's runtime spec field. The sandbox-level cpuset mask, which would be || of all container cpu-set masks, would be utililized to constrain vCPU threads (and perhaps vhost threads).
  • Track the PID(s) associated with the sandbox's QEMU, and taskset them based on the 'or' of each individual container's CPUSet mask. Same would be needed for vhost and iothreads.
  • Update this sandbox-level mask each time an UpdateContainers call is made which includes an updated CPUset mask.

Optimally, but secondary compared to the first set of changes:

With the mandatory bits in place, we'll be using the CPU set provided, but we won't be providing CPU affinity on a per container basis. To fully support CPU affinity in K8S, we'd also need to:

  • Track mapping of physical CPUs to vCPUs inside the guest
  • pin container processes to a particular vCPU set based on CPU mapping.
  • Look into documentation on allocating more CPUs in the system pool for running non-container vCPUs associated with Kata (ie, run vhost threads and / or shim processes on a particular CPU set as well)
@egernst egernst added the feature New functionality label Nov 2, 2018
@egernst egernst changed the title support for static configuration of k8s cpu manager - contains level cpu affinity support for static configuration of k8s cpu manager - container level cpu affinity Nov 2, 2018
egernst pushed a commit to egernst/runtime that referenced this issue Nov 2, 2018
Original logic would not allow a container to be updated if it wasn't
in the running state. A container should be able to be updated if it
is either running or is created and in ready state.

This better matches the logic in CRI-O today, and allows for the use
case where a created container is updated by due to K8S CPU manager
adjusting the CPU affinity of a created but not yet running container.

Fixes kata-containers#878

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@sboeuf
Copy link

sboeuf commented Nov 2, 2018

@egernst regarding the mandatory requirement:

Reset this mask each time an UpdateContainers call is made which includes an updated CPUset mask.

Do you mean update by reset? Just want to make sure it's clear that every time a new UpdateContainers occurs, the mask should be updated.

krsna1729 added a commit to clearlinux/cloud-native-setup that referenced this issue Jan 3, 2019
Temporarily disabled `cpuManagerPolicy=static`, issue below
kata-containers/runtime#878

Provided kata equivalent yaml for cpumanager test

Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
mcastelino pushed a commit to clearlinux/cloud-native-setup that referenced this issue Jan 3, 2019
Temporarily disabled `cpuManagerPolicy=static`, issue below
kata-containers/runtime#878

Provided kata equivalent yaml for cpumanager test

Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
@jcvenegas jcvenegas self-assigned this Jan 3, 2019
@mcastelino
Copy link
Contributor

/cc @devimc has this been addressed?

@jcvenegas
Copy link
Member

@mcastelino

cpuset are honor at host level for vcpus this mean but iothreads still be jumping around not dedicated vcpus.
Today is possible to run static policy workloads on Kata and k8s, but there are missing the following optimizations.

  • cpupining of vcpus and cpus
  • vcpu mappaing of cpus and vcpus.

@mcastelino
Copy link
Contributor

mcastelino commented Mar 19, 2019

@jcvenegas what will nproc inside the kata container show

@jcvenegas
Copy link
Member

This is not consistent today.
Inside the katacontainer , nproc will report the amount of vcpus added to the cpuset cgroup, for most cases will report the amount of vcpus in the VM.
Why?

The guest cpuset cgroup is applied when the cpuset request is a subset of vcpus.

Example:

Case I

+------------------------------------------------------+
|     ContainerA|                    Container B       |
|      Cpuset: 1|                    cpuset: 2-3       |
|               |                                      |
|               |                                      |
|               |                                      |
+------------------------------------------------------+
|   cpu 1       | cpu 2            |  cpu 3            |
+------------------------------------------------------+
|                                                      |
|                 VM                                   |
+------------------------------------------------------+

The VM has 1-3 online vcpus
There is a pod definition with 2 containers:
A: cpuset: 1
B cpuset : 2-3
The nproc output will be 1 for conatainerA and 2 for containerB

Case II

The case I only is valid for the first garanteed pod that is created. If a similar pod is created again the static manager will assign other set of cpus that will not match inside the guest, in this case the agent will ignore this cpuset request and nproc output will be the amount of vcpus in the guest (3 for the case below)

+------------------------------------------------------+
|     ContainerA|                    Container B       |
|      Cpuset: 4|                    cpuset: 5-6       |
|               |                                      |
|               |                                      |
|               |                                      |
+------------------------------------------------------+
|   cpu 1       | cpu 2            |  cpu 3            |
+------------------------------------------------------+
|                                                      |
|                 VM                                   |
+------------------------------------------------------+

The vcpu container mapping is still required to have a runc like output.

@bergwolf
Copy link
Member

@jcvenegas In the second case, I think kata-runtime should take action and adjust guest process cpuset accordingly. Namingly:

  1. pin qemu vCPU to host cpu
  2. maintain an internal mapping to mark that vcpu1-3 is actually mapped to host cpu 4-6.
  3. adjust cpuset in container spec to honor the vCPU to CPU mapping before sending the spec to kata-agent.

@jcvenegas
Copy link
Member

Added sub task related with cpumanager static policy.

#1430

@bergwolf
Copy link
Member

Hey guys, I've got a chance to look at this again. Per kubernetes cpu manager policy, IMO we need to handle BestEffort/Burstable vs. Guaranteed differently. For containers runs in the shared CPU pool, we let them compete with kata components as well. For guaranteed containers, we need to make sure nothing competes with them even in the guest.

With a static kubernetes CPU manger policy, node CPU resources are actually divided into four parts:

  • system reserved: reserved for all other system programes
  • kubelet reserved: reserved for kubelet process
  • shared CPUs pool: all CPU resource minus system reserved and kubelet reserved
  • dedicated CPUs: taken from shared CPUs pool whenever requested

To achieve it, when k8s static CPU manager policy is enabled, every container should have a cpuset cgroup contraints otherwise it will compete with dedicated CPU containers. (Need confirmation but I think it is the only way to work).

So I propose a global design principle/choice to the question of where to put the kata components in cpu cgroups: put it in the shared CPU pool.

Following such a principle, I propose the following design to integrate kata with k8s static CPU manager:

  1. host side kata components (kata-shim, kata-proxy, qemu main thread, qemu io threads etc.) use the shared CPU pool, but do not consume container CPU quota. IOW, they have the same cpuset with the shared CPU pool, but in a different cpu cgroup so that they do not share container CPU quota. It will require k8s to set cpuset for the very first pause container as well, if it doesn't, let's fix it in crio/containerd.
  2. guest side kata components (guest kernel threads, kata agent, systemd related staffs) use the shared CPU pool. It can be achieved by starting the guest with a default number of vCPUs and put them in the same cpuset cgroup as host side kata components.
  3. containers with shared CPU pool constraints (cpuset + fractional CUP quota): hotplug required vCPUs to the guest, add these vCPU threads to container cpu cgroup (cpuset+quota).
  4. containers with dedicated CPUconstraints (cpuset + integer quantity CPU quota): hotplug required vCPUs to the guest, add these vCPU threads to container cpu cgroup (cpuset+quota).
  5. before creating containers in the guest, modify its spec to reflect the vcpu/cpu mapping change.
  6. allow to update a container's cgroup cpuset when it is running as k8s would ask to refresh it when host CPUs are taken from or added to the shared CPU pool.

wdyt? @egernst @WeiZhang555 @sboeuf @kata-containers/architecture-committee

@WeiZhang555
Copy link
Member

@bergwolf I think it makes sense, I like the idea.

But it sounds soooooooo complicated, now the picture is a mess in my mind though I can understand each detailed item 🤣

@bergwolf
Copy link
Member

@WeiZhang555 The idea is to let kata components compete with containers in the shared CPU pool. All the steps are there to make it happen. Let me put up some slides and bring it up in the next AC meeting.

@sboeuf
Copy link

sboeuf commented Mar 28, 2019

@bergwolf 👍

@mcastelino
Copy link
Contributor

@bergwolf using the shared pool for QEMU will cause problems. When you perform auto scaling, it is done based on the actual resource usage of the pod. If the QEMU overhead is not accounted for properly we will not scale correctly.

Also if the user needs a pod that has the same performance as a runc based pod, we should request higher amount of resources. This will help size the resource/performance requirement when running with Kata.

@bergwolf
Copy link
Member

@mcastelino I would expect only use the same cpuset of the shared CPU pool for kata components but not use any containers's CPU quota. It means that QEMU processes will compete with containers on the same node for shared pCPUs with no guaranteed CPU shares.

As for auto-scaling scheduling, sandbox overhead needs to be counted at the kubernetes level and it is not something we can solve in kata containers. Right now it is not counted. And I don't see my proposal changing any aspect of the current situation.

Also if the user needs a pod that has the same performance as a runc based pod, we should request higher amount of resources.

No, we don't need to request more for kata components. We just make sure kata components do not use containers' CPU quota. Then containers get what they ask for, and kata components are left to compete freely for the remaining shared CPU resources.

@krsna1729
Copy link

@bergwolf would it mean/be possible a burstable pod's overall performance suffers since its iothreads are not receiving CPU quota guarantees?

@bergwolf
Copy link
Member

@krsna1729 Yes, it is possible. The situation is the same with runc, in which case kernel iothreads are not getting cpu guarantee either. The reason (for it being ignored right now) is that CPU is not usually in the critical path for IO performance.

However, I agree that we need further tuning for very high speed IO devices if we want to outperform runc, -- we need a way to allocate CPU quota or even cpuset for iothreads without hurting container performance. But it is really a special case and different devices may require different handling (I'm thinking of dpdk, spdk, nvme, all kinds of vhost-user-[net,blk,fs] etc.). So I prefer to make the general case work first.

@mcastelino
Copy link
Contributor

@bergwolf @krsna1729 @egernst @jcvenegas @devimc I wanted to completely capture the behavior of k8s as it exists today w.r.t cgroups (cpu, cpuset, memory). It is a bit too long to put down here but captured in a gist

https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f

Please let me know if any of these observations are incorrect. Or if you have a way to setup system isolation in a manner in which you get behavior described in #878 (comment)

A quick summary

  • cpusets today are split into only two pools (guaranteed and everything else)
  • kube and system overheads are not captured for cpusets and not in either of these pools
  • The assured scheduling kube and system components is only managed via cpu shares
  • Total Memory limits are imposed on the kubepods
  • Total memory limits are not imposed on the system and kube components

This should hopefully result in a properly informed more productive discussion.

@jodh-intel jodh-intel added this to To do in Issue backlog Aug 10, 2020
egernst added a commit to egernst/runtime that referenced this issue Feb 9, 2021
…nch-bump

# Kata Containers 1.13.0-alpha0
Issue backlog automation moved this from To do to Done Apr 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New functionality
Projects
Issue backlog
  
Done
Development

No branches or pull requests

8 participants