Description
On a host running unified cgroup v2 with systemd-cgroup = "true" in runsc.toml,
cAdvisor only reports per-scope metrics for the pause container's cgroup scope.
Application and sidecar containers in the same pod have no cri-containerd-*.scope
directory under the pod slice, so cAdvisor's inotify watcher never discovers them
and /metrics/cadvisor is missing CPU / memory / network series for them.
The kubelet's CRI-backed /stats/summary endpoint reports the values fine, so the
data is available -- it just isn't surfaced through cAdvisor.
This looks like the cgroup v2 + systemd analogue of #6500, which was fixed for v1
non-systemd in #6657 by creating empty subcontainer cgroup directories so cAdvisor
could detect them. Tracing the current code:
runsc/container/container.go::setupCgroupForSubcontainer calls cgroupInstall(conf, cg, &specs.LinuxResources{}).
- For systemd v2, the underlying
cgroupSystemd.Install() (in runsc/cgroup/systemd.go)
only stages dbus properties; the systemd transient scope unit (and therefore the
cgroup directory on the host) is only created in Join() -- which is not called
for these compat-only subcontainer cgroups.
Net effect: with systemd-cgroup = "true" on a v2 host, the empty subcontainer
cgroup directory that would let cAdvisor discover the container is never created,
so per-container cAdvisor metrics regress to "pause-only" for every gVisor pod.
This regresses anything keyed off cAdvisor: kubectl top pod, container-level
CPU/memory dashboards, VPA recommendations driven from cAdvisor series, etc.
Expected
For a pod with N user containers running under runsc + systemd-cgroup = "true"
on a cgroup v2 host, cAdvisor should expose per-container series for each user
container (matching what runc exposes today and what kubelet's /stats/summary
already reports via CRI for the same pod). Equivalent fix path to #6657, but
working on the systemd v2 code path -- e.g. ensure setupCgroupForSubcontainer
on systemd v2 actually materializes an empty cgroup directory under the pod slice
that cAdvisor can inotify-watch.
Actual
Only the pause container's scope is visible to cAdvisor; user containers are
absent from /metrics/cadvisor, and the per-container cgroup directories
(memory.current, memory.stat, memory.max, cpu.stat, ...) do not exist
under the pod slice on the host.
Steps to reproduce
Cluster: cgroup v2 unified host (stat -fc %T /sys/fs/cgroup -> cgroup2fs),
containerd 2.x, kubernetes 1.35, runsc registered as RuntimeClass: gvisor,
runsc.toml:
[runsc_config]
net-raw = "true"
systemd-cgroup = "true"
Pod: runtimeClassName: gvisor, two containers (sidecar istio-proxy +
application container).
$ kubectl get pods -n <namespace> -l role=<app> -owide
NAME READY STATUS NODE
<app>-...-rls5m 2/2 Running <node>
$ kubectl get pods -n <namespace> -l role=<app> -oyaml | grep runtimeClass
runtimeClassName: gvisor
cAdvisor (kubelet /metrics/cadvisor)
Total container_memory_working_set_bytes series on the node: 346.
For the gVisor pod, only 2 series, both at the pod-slice / pause-scope level:
$ kubectl get --raw "/api/v1/nodes/<node>/proxy/metrics/cadvisor" \
| grep container_memory_working_set_bytes | grep <app>
container_memory_working_set_bytes{container="",
id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice",
image="",name="",namespace="<namespace>",pod="<app>-...-rls5m"} 2.268966912e+09 ...
container_memory_working_set_bytes{container="",
id="/kubepods.slice/.../cri-containerd-<pause-id>.scope",
image="localhost/kubernetes/pause:latest",
name="<pause-id>",namespace="<namespace>",pod="<app>-...-rls5m"} 2.268966912e+09 ...
No series for container="istio-proxy" or container="<app>".
kubelet /stats/summary (CRI-backed) for the same pod
Both containers show real CPU/memory:
{
"name": "istio-proxy",
"cpu": { "usageNanoCores": 18618234, "usageCoreNanoSeconds": 305217528000 },
"memory": { "workingSetBytes": 146890752, "usageBytes": 146890752 }
}
{
"name": "<app>",
"cpu": { "usageNanoCores": 19078357, "usageCoreNanoSeconds": 328723623000 },
"memory": { "workingSetBytes": 2868473856, "usageBytes": 2868473856 }
}
So the data exists; it's just not exposed in the cAdvisor cgroup tree.
Host filesystem
Under /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/:
- only the pause container's
cri-containerd-<pause-id>.scope/ exists
- no
cri-containerd-*.scope/ directory for the app or sidecar container
- consequently, no
memory.current / memory.stat / memory.max for those
containers on the host
For comparison, a runc pod on the same node has one scope per user container
under the pod slice with the usual v2 files populated.
Environment
runsc version
runsc version release-20260302.0
spec: 1.1.0-rc.1
(Repros identically on stock release; we run a local build that cherry-picks
#12686 and #12688 on top of release-20260316.0 for Istio DNS capture, but
neither patch touches cgroup code.)
uname
Linux ... 6.x ... x86_64 GNU/Linux (Ubuntu 22.04, EKS worker)
kubectl
Client Version: v1.35.x
Server Version: v1.35.x
containerd
containerd 2.2.2, runtime registered as:
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"
Related
Description
On a host running unified cgroup v2 with
systemd-cgroup = "true"inrunsc.toml,cAdvisor only reports per-scope metrics for the pause container's cgroup scope.
Application and sidecar containers in the same pod have no
cri-containerd-*.scopedirectory under the pod slice, so cAdvisor's inotify watcher never discovers them
and
/metrics/cadvisoris missing CPU / memory / network series for them.The kubelet's CRI-backed
/stats/summaryendpoint reports the values fine, so thedata is available -- it just isn't surfaced through cAdvisor.
This looks like the cgroup v2 + systemd analogue of #6500, which was fixed for v1
non-systemd in #6657 by creating empty subcontainer cgroup directories so cAdvisor
could detect them. Tracing the current code:
runsc/container/container.go::setupCgroupForSubcontainercallscgroupInstall(conf, cg, &specs.LinuxResources{}).cgroupSystemd.Install()(inrunsc/cgroup/systemd.go)only stages dbus properties; the systemd transient scope unit (and therefore the
cgroup directory on the host) is only created in
Join()-- which is not calledfor these compat-only subcontainer cgroups.
Net effect: with
systemd-cgroup = "true"on a v2 host, the empty subcontainercgroup directory that would let cAdvisor discover the container is never created,
so per-container cAdvisor metrics regress to "pause-only" for every gVisor pod.
This regresses anything keyed off cAdvisor:
kubectl top pod, container-levelCPU/memory dashboards, VPA recommendations driven from cAdvisor series, etc.
Expected
For a pod with N user containers running under runsc +
systemd-cgroup = "true"on a cgroup v2 host, cAdvisor should expose per-container series for each user
container (matching what
runcexposes today and what kubelet's/stats/summaryalready reports via CRI for the same pod). Equivalent fix path to #6657, but
working on the systemd v2 code path -- e.g. ensure
setupCgroupForSubcontaineron systemd v2 actually materializes an empty cgroup directory under the pod slice
that cAdvisor can inotify-watch.
Actual
Only the pause container's scope is visible to cAdvisor; user containers are
absent from
/metrics/cadvisor, and the per-container cgroup directories(
memory.current,memory.stat,memory.max,cpu.stat, ...) do not existunder the pod slice on the host.
Steps to reproduce
Cluster: cgroup v2 unified host (
stat -fc %T /sys/fs/cgroup->cgroup2fs),containerd 2.x, kubernetes 1.35, runsc registered as
RuntimeClass: gvisor,runsc.toml:Pod:
runtimeClassName: gvisor, two containers (sidecaristio-proxy+application container).
cAdvisor (kubelet
/metrics/cadvisor)Total
container_memory_working_set_bytesseries on the node: 346.For the gVisor pod, only 2 series, both at the pod-slice / pause-scope level:
No series for
container="istio-proxy"orcontainer="<app>".kubelet
/stats/summary(CRI-backed) for the same podBoth containers show real CPU/memory:
{ "name": "istio-proxy", "cpu": { "usageNanoCores": 18618234, "usageCoreNanoSeconds": 305217528000 }, "memory": { "workingSetBytes": 146890752, "usageBytes": 146890752 } } { "name": "<app>", "cpu": { "usageNanoCores": 19078357, "usageCoreNanoSeconds": 328723623000 }, "memory": { "workingSetBytes": 2868473856, "usageBytes": 2868473856 } }So the data exists; it's just not exposed in the cAdvisor cgroup tree.
Host filesystem
Under
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/:cri-containerd-<pause-id>.scope/existscri-containerd-*.scope/directory for the app or sidecar containermemory.current/memory.stat/memory.maxfor thosecontainers on the host
For comparison, a
runcpod on the same node has one scope per user containerunder the pod slice with the usual v2 files populated.
Environment
runsc version
(Repros identically on stock release; we run a local build that cherry-picks
#12686 and #12688 on top of
release-20260316.0for Istio DNS capture, butneither patch touches cgroup code.)
uname
Linux ... 6.x ... x86_64 GNU/Linux(Ubuntu 22.04, EKS worker)kubectl
containerd
containerd 2.2.2, runtime registered as:Related