-
Notifications
You must be signed in to change notification settings - Fork 377
support for static configuration of k8s cpu manager - container level cpu affinity #878
Comments
Original logic would not allow a container to be updated if it wasn't in the running state. A container should be able to be updated if it is either running or is created and in ready state. This better matches the logic in CRI-O today, and allows for the use case where a created container is updated by due to K8S CPU manager adjusting the CPU affinity of a created but not yet running container. Fixes kata-containers#878 Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@egernst regarding the mandatory requirement:
Do you mean |
Temporarily disabled `cpuManagerPolicy=static`, issue below kata-containers/runtime#878 Provided kata equivalent yaml for cpumanager test Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Temporarily disabled `cpuManagerPolicy=static`, issue below kata-containers/runtime#878 Provided kata equivalent yaml for cpumanager test Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
/cc @devimc has this been addressed? |
cpuset are honor at host level for vcpus this mean but iothreads still be jumping around not dedicated vcpus.
|
@jcvenegas what will |
This is not consistent today. The guest Example: Case I
The VM has Case IIThe
The vcpu container mapping is still required to have a runc like output. |
@jcvenegas In the second case, I think kata-runtime should take action and adjust guest process cpuset accordingly. Namingly:
|
Added sub task related with cpumanager static policy. |
Hey guys, I've got a chance to look at this again. Per kubernetes cpu manager policy, IMO we need to handle With a static kubernetes CPU manger policy, node CPU resources are actually divided into four parts:
To achieve it, when k8s static CPU manager policy is enabled, every container should have a cpuset cgroup contraints otherwise it will compete with dedicated CPU containers. (Need confirmation but I think it is the only way to work). So I propose a global design principle/choice to the question of where to put the kata components in cpu cgroups: put it in the shared CPU pool. Following such a principle, I propose the following design to integrate kata with k8s static CPU manager:
wdyt? @egernst @WeiZhang555 @sboeuf @kata-containers/architecture-committee |
@bergwolf I think it makes sense, I like the idea. But it sounds soooooooo complicated, now the picture is a mess in my mind though I can understand each detailed item 🤣 |
@WeiZhang555 The idea is to let kata components compete with containers in the shared CPU pool. All the steps are there to make it happen. Let me put up some slides and bring it up in the next AC meeting. |
@bergwolf using the shared pool for QEMU will cause problems. When you perform auto scaling, it is done based on the actual resource usage of the pod. If the QEMU overhead is not accounted for properly we will not scale correctly. Also if the user needs a pod that has the same performance as a runc based pod, we should request higher amount of resources. This will help size the resource/performance requirement when running with Kata. |
@mcastelino I would expect only use the same cpuset of the shared CPU pool for kata components but not use any containers's CPU quota. It means that QEMU processes will compete with containers on the same node for shared pCPUs with no guaranteed CPU shares. As for auto-scaling scheduling, sandbox overhead needs to be counted at the kubernetes level and it is not something we can solve in kata containers. Right now it is not counted. And I don't see my proposal changing any aspect of the current situation.
No, we don't need to request more for kata components. We just make sure kata components do not use containers' CPU quota. Then containers get what they ask for, and kata components are left to compete freely for the remaining shared CPU resources. |
@bergwolf would it mean/be possible a burstable pod's overall performance suffers since its iothreads are not receiving CPU quota guarantees? |
@krsna1729 Yes, it is possible. The situation is the same with runc, in which case kernel iothreads are not getting cpu guarantee either. The reason (for it being ignored right now) is that CPU is not usually in the critical path for IO performance. However, I agree that we need further tuning for very high speed IO devices if we want to outperform runc, -- we need a way to allocate CPU quota or even cpuset for iothreads without hurting container performance. But it is really a special case and different devices may require different handling (I'm thinking of dpdk, spdk, nvme, all kinds of vhost-user-[net,blk,fs] etc.). So I prefer to make the general case work first. |
@bergwolf @krsna1729 @egernst @jcvenegas @devimc I wanted to completely capture the behavior of k8s as it exists today w.r.t cgroups (cpu, cpuset, memory). It is a bit too long to put down here but captured in a gist https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f Please let me know if any of these observations are incorrect. Or if you have a way to setup system isolation in a manner in which you get behavior described in #878 (comment) A quick summary
This should hopefully result in a properly informed more productive discussion. |
…nch-bump # Kata Containers 1.13.0-alpha0
Description of problem
We only think in terms of vCPUS today, that is shares/quota.
Expected result
We need to look at CPU affinity, and in particular the Cpus field:
https://github.com/kubernetes-sigs/cri-o/blob/master/vendor/github.com/opencontainers/runtime-spec/specs-go/config.go#L304
In a default kubernetes configuration, this mask should be set for all CPUs. However, if the cpu-manager is configured as static, it is possible to start setting CPU affinities on a best effort basis on a container granularity. With this in place, you'll see specific masks unique to each container.
Actual result
Today, if a user were to setup a mixed cluster with runc and kata, the kata runtime ignores the CPU set passed in, resulting in the vCPU (and vhost) threads running across all available CPUs (no isol is in place - affinity is managed by kubelet itself). This would result in kata based containers not only not getting a performance tuned affinity, but it would also result in our containers likely utilizing CPUs which kubelet wanted dedicated.
Proposal
Mandatory:
The following would need to be done in order to make sure we aren't utilizing CPUs dedicated to other pods.
Optimally, but secondary compared to the first set of changes:
With the mandatory bits in place, we'll be using the CPU set provided, but we won't be providing CPU affinity on a per container basis. To fully support CPU affinity in K8S, we'd also need to:
The text was updated successfully, but these errors were encountered: