Description
After upgrading gVisor from release 20210322.0 to 20210720, we noticed that we were missing some cAdvisor (ex: container_network_receive_packets_total, container_network_receive_bytes_total etc) prometheus container metrics for all pods/containers running on gVisor.
During our debugging we tried to nail down any gVisor changes that could have caused this issue. We did this by going through the gVisor changelog while also running multiple different versions of gVisor to find the release where this stopped working. We found that release 20210518.0 was when metrics stopped being collected. 20210510.0 release was the last release we were able to successufully receieve metrics from cAdvisor. Looking at the changelog between the versions, we came across the following commit where the sandbox was updated to use the pod cgroup instead of the (1st) container cgroup it was using. Given that cAdvisor is built to deliver per container metrics but the mentioned commit moves the sandbox (and its metrics) to the pod's cgroup, cAdvisor stops emitting metrics for all gVisor containers.
Comparing the sandbox config for my pod between the working version of gVisor and broken version, the only changes that stood out were the cgroup changes. Examples of the sandbox config:
working
"sandbox": {
"id": "e7b82bbf3ee2901aaec811ec59d807edaf6c6967884103ac418cfa3b032481da",
"pid": 342475,
"cgroup": {
"name": "/kubepods/burstable/pod308d8927-7e25-443c-8dad-d390f7023b0e/e7b82bbf3ee2901aaec811ec59d807edaf6c6967884103ac418cfa3b032481da",
"parents": null,
"own": {
"blkio": true,
"cpu": true,
"cpuset": true,
"devices": true,
"freezer": true,
"hugetlb": true,
"memory": true,
"net_prio": true,
"perf_event": true,
"pids": true,
"rdma": true,
"systemd": true
}
},
"originalOomScoreAdj": -999
}
not working
"sandbox": {
"id": "71e675350a4c07b79cae482e650c1beeb1e23488112d3b751838a9dc8e17399a",
"pid": 359425,
"cgroup": {
"name": "/kubepods/burstable/pod2baf1dc0-3715-4153-a8c2-b46225e219cc",
"parents": null,
"own": {
"devices": true,
"rdma": true
}
},
"originalOomScoreAdj": -999
}
Steps to reproduce
In order to reproduce the bugs, you need a running k8s cluster which has cAdvisor metrics enabled as part of the kubelet. We rely on using Prometheus to scrape the metrics but in order to simply test this issue one can run a curl command against the kubelet on the node on which the gVisor pod is running. These steps assume you are able to authenticate requests against the kubelet on the node.
- Run
20210518.0 or newer release of gVisor containerd shim and runsc
- Deploy a pod using gVisor on k8s that is able to recieve and/or respond to network calls. I was utilizing sample-golang-notes
- Run GET requests against your running app
- Run
curl -H 'authorization: Bearer $TOKEN' -k https://KUBELET_IP/metrics/cadvisor and search/grep for container_network_receive_bytes_total or container_network_receive_packets_total metrics which are associated with your namespace.
You should see the mentioned metrics missing from the reponse to request. If you downgrade your gVisor release to 20210510.0 or below and follow the same steps, you should start to see cAdvisor metrics mentioned come through again.
runsc version
> runsc --version
runsc version release-20210518.0
spec: 1.0.2
docker version (if using docker)
No response
uname
Linux NODE 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
kubectl (if using Kubernetes)
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
repo state (if built from source)
No response
runsc debug logs (if available)
No response
Description
After upgrading gVisor from release
20210322.0to20210720, we noticed that we were missing some cAdvisor (ex:container_network_receive_packets_total,container_network_receive_bytes_totaletc) prometheus container metrics for all pods/containers running on gVisor.During our debugging we tried to nail down any gVisor changes that could have caused this issue. We did this by going through the gVisor changelog while also running multiple different versions of gVisor to find the release where this stopped working. We found that release
20210518.0was when metrics stopped being collected.20210510.0release was the last release we were able to successufully receieve metrics from cAdvisor. Looking at the changelog between the versions, we came across the following commit where the sandbox was updated to use the pod cgroup instead of the (1st) container cgroup it was using. Given that cAdvisor is built to deliver per container metrics but the mentioned commit moves the sandbox (and its metrics) to the pod's cgroup, cAdvisor stops emitting metrics for all gVisor containers.Comparing the sandbox config for my pod between the working version of gVisor and broken version, the only changes that stood out were the cgroup changes. Examples of the sandbox config:
working
not working
Steps to reproduce
In order to reproduce the bugs, you need a running k8s cluster which has cAdvisor metrics enabled as part of the kubelet. We rely on using Prometheus to scrape the metrics but in order to simply test this issue one can run a curl command against the kubelet on the node on which the gVisor pod is running. These steps assume you are able to authenticate requests against the kubelet on the node.
20210518.0or newer release of gVisorcontainerdshim andrunsccurl -H 'authorization: Bearer $TOKEN' -k https://KUBELET_IP/metrics/cadvisorand search/grep forcontainer_network_receive_bytes_totalorcontainer_network_receive_packets_totalmetrics which are associated with your namespace.You should see the mentioned metrics missing from the reponse to request. If you downgrade your gVisor release to
20210510.0or below and follow the same steps, you should start to see cAdvisor metrics mentioned come through again.runsc version
> runsc --version runsc version release-20210518.0 spec: 1.0.2docker version (if using docker)
No response
uname
Linux NODE 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64 GNU/Linux
kubectl (if using Kubernetes)
❯ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}repo state (if built from source)
No response
runsc debug logs (if available)
No response