Skip to content

cAdvisor only sees pause container scope on cgroup v2 + systemd-cgroup (subcontainer compat dirs not created) #13067

@a7i

Description

@a7i

Description

On a host running unified cgroup v2 with systemd-cgroup = "true" in runsc.toml,
cAdvisor only reports per-scope metrics for the pause container's cgroup scope.
Application and sidecar containers in the same pod have no cri-containerd-*.scope
directory under the pod slice, so cAdvisor's inotify watcher never discovers them
and /metrics/cadvisor is missing CPU / memory / network series for them.

The kubelet's CRI-backed /stats/summary endpoint reports the values fine, so the
data is available -- it just isn't surfaced through cAdvisor.

This looks like the cgroup v2 + systemd analogue of #6500, which was fixed for v1
non-systemd in #6657 by creating empty subcontainer cgroup directories so cAdvisor
could detect them. Tracing the current code:

  • runsc/container/container.go::setupCgroupForSubcontainer calls cgroupInstall(conf, cg, &specs.LinuxResources{}).
  • For systemd v2, the underlying cgroupSystemd.Install() (in runsc/cgroup/systemd.go)
    only stages dbus properties; the systemd transient scope unit (and therefore the
    cgroup directory on the host) is only created in Join() -- which is not called
    for these compat-only subcontainer cgroups.

Net effect: with systemd-cgroup = "true" on a v2 host, the empty subcontainer
cgroup directory that would let cAdvisor discover the container is never created,
so per-container cAdvisor metrics regress to "pause-only" for every gVisor pod.

This regresses anything keyed off cAdvisor: kubectl top pod, container-level
CPU/memory dashboards, VPA recommendations driven from cAdvisor series, etc.

Expected

For a pod with N user containers running under runsc + systemd-cgroup = "true"
on a cgroup v2 host, cAdvisor should expose per-container series for each user
container (matching what runc exposes today and what kubelet's /stats/summary
already reports via CRI for the same pod). Equivalent fix path to #6657, but
working on the systemd v2 code path -- e.g. ensure setupCgroupForSubcontainer
on systemd v2 actually materializes an empty cgroup directory under the pod slice
that cAdvisor can inotify-watch.

Actual

Only the pause container's scope is visible to cAdvisor; user containers are
absent from /metrics/cadvisor, and the per-container cgroup directories
(memory.current, memory.stat, memory.max, cpu.stat, ...) do not exist
under the pod slice on the host.

Steps to reproduce

Cluster: cgroup v2 unified host (stat -fc %T /sys/fs/cgroup -> cgroup2fs),
containerd 2.x, kubernetes 1.35, runsc registered as RuntimeClass: gvisor,
runsc.toml:

[runsc_config]
  net-raw = "true"
  systemd-cgroup = "true"

Pod: runtimeClassName: gvisor, two containers (sidecar istio-proxy +
application container).

$ kubectl get pods -n <namespace> -l role=<app> -owide
NAME                READY   STATUS    NODE
<app>-...-rls5m     2/2     Running   <node>

$ kubectl get pods -n <namespace> -l role=<app> -oyaml | grep runtimeClass
    runtimeClassName: gvisor

cAdvisor (kubelet /metrics/cadvisor)

Total container_memory_working_set_bytes series on the node: 346.
For the gVisor pod, only 2 series, both at the pod-slice / pause-scope level:

$ kubectl get --raw "/api/v1/nodes/<node>/proxy/metrics/cadvisor" \
    | grep container_memory_working_set_bytes | grep <app>
container_memory_working_set_bytes{container="",
  id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice",
  image="",name="",namespace="<namespace>",pod="<app>-...-rls5m"} 2.268966912e+09 ...
container_memory_working_set_bytes{container="",
  id="/kubepods.slice/.../cri-containerd-<pause-id>.scope",
  image="localhost/kubernetes/pause:latest",
  name="<pause-id>",namespace="<namespace>",pod="<app>-...-rls5m"} 2.268966912e+09 ...

No series for container="istio-proxy" or container="<app>".

kubelet /stats/summary (CRI-backed) for the same pod

Both containers show real CPU/memory:

{
  "name": "istio-proxy",
  "cpu":    { "usageNanoCores": 18618234, "usageCoreNanoSeconds": 305217528000 },
  "memory": { "workingSetBytes": 146890752, "usageBytes": 146890752 }
}
{
  "name": "<app>",
  "cpu":    { "usageNanoCores": 19078357, "usageCoreNanoSeconds": 328723623000 },
  "memory": { "workingSetBytes": 2868473856, "usageBytes": 2868473856 }
}

So the data exists; it's just not exposed in the cAdvisor cgroup tree.

Host filesystem

Under /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/:

  • only the pause container's cri-containerd-<pause-id>.scope/ exists
  • no cri-containerd-*.scope/ directory for the app or sidecar container
  • consequently, no memory.current / memory.stat / memory.max for those
    containers on the host

For comparison, a runc pod on the same node has one scope per user container
under the pod slice with the usual v2 files populated.

Environment

runsc version

runsc version release-20260302.0
spec: 1.1.0-rc.1

(Repros identically on stock release; we run a local build that cherry-picks
#12686 and #12688 on top of release-20260316.0 for Istio DNS capture, but
neither patch touches cgroup code.)

uname

Linux ... 6.x ... x86_64 GNU/Linux (Ubuntu 22.04, EKS worker)

kubectl

Client Version: v1.35.x
Server Version: v1.35.x

containerd

containerd 2.2.2, runtime registered as:

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runsc.options]
  TypeUrl    = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/containerd/runsc.toml"

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions