Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet spamming 'Unable to fetch pod log stats' log messages after running cronjobs #106957

Closed
rpthms opened this issue Dec 10, 2021 · 42 comments · Fixed by #108855
Closed

Kubelet spamming 'Unable to fetch pod log stats' log messages after running cronjobs #106957

rpthms opened this issue Dec 10, 2021 · 42 comments · Fixed by #108855
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@rpthms
Copy link

rpthms commented Dec 10, 2021

What happened?

I have a cronjob that runs a Python script every 5 minutes. After the cronjob finishes running, I start seeing a bunch of log messages in my systemd journal like the following:

Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.048459 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319165--1-n2xq8_0a0270b6-d650-453d-96db-2fbb7c43ba42: no such file or directory" pod="k8s-homelab/transmission-manager-27319165--1-n2xq8"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.049381 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319175--1-frnvv_4c1ba4d5-1098-4644-ad4e-15eaa6e76767: no such file or directory" pod="k8s-homelab/transmission-manager-27319175--1-frnvv"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.050750 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319170--1-k7dwv_7a94a981-25e0-4079-a60f-9ed00bb859bb: no such file or directory" pod="k8s-homelab/transmission-manager-27319170--1-k7dwv"

Those directories in /var/log don't exist as they belong to old pods that have finished running.

These messages keep increasing in volume as time passes (almost received a 1000 of those messages at one time, after which I had to reboot the node to fix the issue). It almost looks like the kubelet is not cleaning up the old pods for some reason. I can't see these old pods when I run kubectl get pods -A, so I'm assuming their is some issue with the kubelet's cleanup process.

What did you expect to happen?

kubelet (or cadvisor) shouldn't try to read logs for pods that don't exist.

How can we reproduce it (as minimally and precisely as possible)?

Setup a short lived cronjob running every 5 minutes and observe the logs (Kubernetes v1.22.4, CRI-O v1.22.1, Ubuntu 20.04)

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Baremetal cluster

OS version

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ uname -a
Linux gb-7983 5.11.0-41-generic #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubeadm (baremetal cluster)

Container runtime (CRI) and and version (if applicable)

CRI-O v1.22.1

Related plugins (CNI, CSI, ...) and versions (if applicable)

@rpthms rpthms added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2021
@rpthms
Copy link
Author

rpthms commented Dec 10, 2021

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 10, 2021
@ehashman ehashman added this to Triage in SIG Node Bugs Dec 10, 2021
@pacoxu
Copy link
Member

pacoxu commented Dec 13, 2021

podToStats := map[statsapi.PodReference]*statsapi.PodStats{}
for key, cinfo := range filteredInfos {
// On systemd using devicemapper each mount into the container has an
// associated cgroup. We ignore them to ensure we do not get duplicate
// entries in our summary. For details on .mount units:
// http://man7.org/linux/man-pages/man5/systemd.mount.5.html
if strings.HasSuffix(key, ".mount") {
continue
}
// Build the Pod key if this container is managed by a Pod
if !isPodManagedContainer(&cinfo) {
continue
}
ref := buildPodRef(cinfo.Spec.Labels)
// Lookup the PodStats for the pod using the PodRef. If none exists,
// initialize a new entry.
podStats, found := podToStats[ref]
if !found {
podStats = &statsapi.PodStats{PodRef: ref}
podToStats[ref] = podStats
}
// Update the PodStats entry with the stats from the container by
// adding it to podStats.Containers.
containerName := kubetypes.GetContainerName(cinfo.Spec.Labels)
if containerName == leaky.PodInfraContainerName {
// Special case for infrastructure container which is hidden from
// the user and has network stats.
podStats.Network = cadvisorInfoToNetworkStats(&cinfo)
} else {
podStats.Containers = append(podStats.Containers, *cadvisorInfoToContainerStats(containerName, &cinfo, &rootFsInfo, &imageFsInfo))
}
}
// Add each PodStats to the result.
result := make([]statsapi.PodStats, 0, len(podToStats))
for _, podStats := range podToStats {
// Lookup the volume stats for each pod.
podUID := types.UID(podStats.PodRef.UID)
var ephemeralStats []statsapi.VolumeStats
if vstats, found := p.resourceAnalyzer.GetPodVolumeStats(podUID); found {
ephemeralStats = make([]statsapi.VolumeStats, len(vstats.EphemeralVolumes))
copy(ephemeralStats, vstats.EphemeralVolumes)
podStats.VolumeStats = append(append([]statsapi.VolumeStats{}, vstats.EphemeralVolumes...), vstats.PersistentVolumes...)
}
logStats, err := p.hostStatsProvider.getPodLogStats(podStats.PodRef.Namespace, podStats.PodRef.Name, podUID, &rootFsInfo)
if err != nil {
klog.ErrorS(err, "Unable to fetch pod log stats", "pod", klog.KRef(podStats.PodRef.Namespace, podStats.PodRef.Name))
}

@SergeyKanzhelev
Copy link
Member

/assign @bobbypage

for triage

@Jeansen
Copy link

Jeansen commented Dec 30, 2021

I have something similar on a fresh cluster I use for educational purposes using kubernetes 22.5 with cri-o. Here's an excerpt:

Dec 28 19:14:31 bullseye kubelet[26386]: E1228 20:14:31.603845   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:14:31 bullseye kubelet[26386]: E1228 20:14:31.603884   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:41 bullseye kubelet[26386]: E1228 20:14:41.611317   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:41 bullseye kubelet[26386]: E1228 20:14:41.611538   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:14:51 bullseye kubelet[26386]: E1228 20:14:51.626880   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:51 bullseye kubelet[26386]: E1228 20:14:51.627030   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:01 bullseye kubelet[26386]: E1228 20:15:01.642047   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:01 bullseye kubelet[26386]: E1228 20:15:01.643417   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:15:11 bullseye kubelet[26386]: E1228 20:15:11.661392   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:11 bullseye kubelet[26386]: E1228 20:15:11.661681   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"

This is from the master node. Nodes run Debian bulsseye in KVM provisioned with Vagrant. I have not yet tested it out with different versions. If I can assist in testing, I'd be happy to do so!

@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 5, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Jan 5, 2022
@Neraud
Copy link

Neraud commented Mar 15, 2022

I've produced the same issue on my clusters.

I'm running Debian 11 on bare metal (homelab) and in VM (test).
My clusters are installed using kubeadm, and I'm using multus and calico as CNI.

The issue is triggered as soon as a pod is removed.
I've triggered it by deleting a simple Pod, scaling down a Deployment, or running a CronJob.
On the node where the pod was removed, the logs are then flooded by Unable to fetch pod log stats messages, at the rate of 1 line per pod removed every 10 seconds.

I've seen it both on aregular worker nodes, and untainted master nodes.

I've tested under Vagrant with various K8S / CRI-O versions.

With CRI-O 1.21.5, I never triggered the issue with Kubernetes : 1.21.10, 1.22.7 and 1.23.4.

However, with CRI-O >= 1.22, all my tests triggered the issue :

  • Kubernetes 1.22.1, CRI-O 1.22.1
  • Kubernetes 1.22.7, CRI-O 1.22.2
  • Kubernetes 1.23.4, CRI-O 1.23.1

Let me know if I can help by running other tests or capturing specific logs.

@bobbypage
Copy link
Member

Is this only issue with CRI-O? What about containerd? The reports above seem to all be on CRI-O?

/cc @haircommander

@haircommander
Copy link
Contributor

@Neraud openshift seems to have hit this case a couple of times and digging into some bugs leads me to think it will be fixed with 25cf497 and kube 1.24. can you try one of the kube alpha releases? cri-o 1.23.1 should do, or you can try the main branch

@haircommander
Copy link
Contributor

it's possible it's unrelated, but https://bugzilla.redhat.com/show_bug.cgi?id=2040399 is why I am making this guess

@Neraud
Copy link

Neraud commented Mar 16, 2022

I'm used to installing a cluster using Debian packages, so it took me a few tries to have a working v1.24 cluster.

I followed the "Without a package manager" documentation to install from the binaries.

I used Kubernetes v1.24.0-alpha.3 and CRI-O 1.23.1.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.3", GitCommit:"30a21e9abdbbeb78d2b7ce59a79e46299ced2742", GitTreeState:"clean", BuildDate:"2022-02-15T20:46:27Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.3", GitCommit:"30a21e9abdbbeb78d2b7ce59a79e46299ced2742", GitTreeState:"clean", BuildDate:"2022-02-15T20:39:18Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

# crictl version
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.23.1
RuntimeApiVersion:  v1alpha2

I'm not certain everything is working fine, but I can schedule a Pod / Deployment / Job, so I assume it's good enough to test this issue.
And I still reproduce the issue as soon as I remove a Pod.

@ooraini
Copy link

ooraini commented Mar 21, 2022

I'm also hitting the issue in my lab environment(Vagrant). Running Rocky Linux 8.5 Kubernetes 1.22.6 and CRI-O 1.22.2

@haircommander
Copy link
Contributor

ah hah! I found a fix #108855 (and found out why it's only being hit in cri-o and not cadvisor)

@DreamyProtect
Copy link

We're also having this issue on a cluster hosting Gitlab CI jobs, this is quite a trouble since many jobs are spawned and removed daily.

Kubernetes version: v1.22.7
CRI-O version: 1.22.2

Is there any news about the fix ? I looked into @haircommander 's PR, it seemed ok to me but I'm not really experienced in Go... Nevertheless I'd be happy to help

@pabclsn
Copy link

pabclsn commented Apr 19, 2022

up

@visimo-jarod
Copy link

@bobbypage I can confirm on a bare metal cluster that I'm running on containerd that I am also seeing these log messages.

@aneagoe
Copy link

aneagoe commented May 5, 2022

Observing similar issues on 4.10.0-0.okd-2022-04-23-131357 (k8s and crio v1.22.1) and 4.9.0-0.okd-2022-02-12-140851 (k8s v1.23.5+9ce5071 and crio 1.23.2).
The strange thing is that one cluster just some nodes affected while the other, all of them. Below an overview of 30m occurrences of Failed to update stats for container
Cluster running version 4.10.0-0.okd-2022-04-23-131357:

worker-01:
0
worker-02:
39
worker-03:
0
worker-04:
3
worker-05:
0
worker-06:
0
worker-07:
0
worker-08:
0
worker-09:
3
worker-10:
2
worker-11:
0
worker-12:
13519

Cluster running version 4.10.0-0.okd-2022-04-23-131357:

worker-01:
13419
worker-02:
24689

No idea where the discrepancy is coming from.

@DreamyProtect
Copy link

DreamyProtect commented May 11, 2022

Tested with Kubernetes v1.24.0 and crio v1.23.2 on a fresh cluster bootstraped with kubeadm, all machines run on Debian 11. Ran a job 3 times and there's a log line for every execution, so it spams 3 log lines per kubelet loop.

@haircommander
Copy link
Contributor

@aneagoe to verify my suspicion, I would be interested in seeing crictl inspectp for all the pods that are showing up in your script, as well as crictl ps -a to cross reference pods to containers, if you wouldn't mind uploading that somewhere.

@aneagoe
Copy link

aneagoe commented May 19, 2022

@haircommander I'm sure they're not in crio anymore. I'm only using crictl inspectp to ensure I'm never going to touch a scope that's reported under a pod still seen by crio. I'll scale down the DaemonSet on my test cluster, run some workload to generate the leaking pods and then provide you some log snippets and requested output from crictl.

@DreamyProtect
Copy link

I took one of my clusters that's suffering heavily on this issue (hosting gitlab CI jobs), reactivated the logs on it for a second, doesn't take long for the kubelet to log things like this:

E0519 10:17:40.948954  196226 manager.go:1127] Failed to create existing container: /pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f3c87e5_9c3d_4a72_b73e_f7a406d6522
0.slice/crio-67cc87ece833ceaf38675db34921a54ebc3d683d3b308b82371e15b922e2a90a.scope: Error finding container 67cc87ece833ceaf38675db34921a54ebc3d683d3b308b82371e15b922e2a90a: Status 404 r
eturned error &{%!s(*http.body=&{0xc0013ccd98 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x87a160) %!
s(func() error=0x87a0e0)}

Trying to inspect the pod with crictl inspectp 4f3c87e5_9c3d_4a72_b73e_f7a406d65220 returns:

FATA[0000] getting the pod sandbox status for "4f3c87e5_9c3d_4a72_b73e_f7a406d65220": rpc error: code = NotFound desc = could not find pod "4f3c87e5_9c3d_4a72_b73e_f7a406d65220": PodSandbox with ID starting with 4f3c87e5_9c3d_4a72_b73e_f7a406d65220 not found: ID does not exist

I tested with a few pods and the result is always the same.
This seems logic as crictl ps -a gives me the following:

CONTAINER           IMAGE                                                              CREATED             STATE               NAME                ATTEMPT             POD ID
7bd5e0c033deb       7f92d556d4ffebe2ba6d02872c2faa9a35ece816f6ca8ccfe5104dcb0a3fc063   2 days ago          Running             weave-npc           8                   7c6a0edc1fb37
d2a940d48ace1       df29c0a4002c047fe35dab1cba959a4bed6f034ab9b95b14280ea7bb158cc111   2 days ago          Running             weave               8                   7c6a0edc1fb37
f8dbcaad4a029       df29c0a4002c047fe35dab1cba959a4bed6f034ab9b95b14280ea7bb158cc111   2 days ago          Exited              weave-init          8                   7c6a0edc1fb37
f20329512cd6d       2025ac44bb2b384b9f3ecc2b75417ff17d6eedc50d1c41eb817cfc40636d0cc3   2 days ago          Running             kube-proxy          8                   85f567ce595c0

But the slices are still here tho: find /sys/fs/cgroup -type d | grep 4f3c87e5_9c3d_4a72_b73e_f7a406d65220

/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f3c87e5_9c3d_4a72_b73e_f7a406d65220.slice
/sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod4f3c87e5_9c3d_4a72_b73e_f7a406d65220.slice/crio-67cc87ece833ceaf38675db34921a54ebc3d683d3b308b82371e15b922e2a90a.scope

So in my opinion crictl return the right things, it's either the kubelet that doesn't update correctly the list of pods to watch or the systemd problem with the slices.

@aneagoe
Copy link

aneagoe commented May 19, 2022

@haircommander I've attached the logs and requested outputs here (sanitized hostnames). Let me know if you'd need anything else; for now I'll leave them around and not scale up the DS on the test cluster.
k8s_106957.tar.gz

@aneagoe
Copy link

aneagoe commented May 20, 2022

I've posted a quick gist with all the manifests I'm using to deploy a garbage collector daemon set. Nothing fancy, just a bit of patchwork to address the issues I was having.
https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f

@aneagoe
Copy link

aneagoe commented May 23, 2022

Here's another really interesting bit of information. In my case, this issue seems to be only affecting VMs and not bare-metal nodes. It's completely puzzling, but I have 12 bare-metal nodes and 1 VM node in one cluster (the VM is just for testing) and it manifests only on the VM node. As well, my test cluster is made up fully of VMs (ie no bare-metal) and there it happens across the board. The bare metal nodes where the VMs are running are the same are same hardware as the bare-metal nodes of my main cluster (just more NVMe drives).

Edit: amended to clarify this is observed in my case running OKD.

@fkocik
Copy link

fkocik commented May 23, 2022

Sorry but I have the same issue on bare-metal nodes (x86 ArchLinux kernel 5.17.5 with K8S 1.23.6/Crio 1.23.2)

@rkojedzinszky
Copy link

Here's another really interesting bit of information. This issue seems to be only affecting VMs and not bare-metal nodes. It's completely puzzling, but I have 12 bare-metal nodes and 1 VM node in one cluster (the VM is just for testing) and it manifests only on the VM node. As well, my test cluster is made up fully of VMs (ie no bare-metal) and there it happens across the board. The bare metal nodes where the VMs are running are the same are same hardware as the bare-metal nodes of my main cluster (just more NVMe drives).

The issue in title happens on bare metal nodes too.

@ryuseongryong
Copy link

I've got the same issue on bare-metal nodes(k8s v1.23.0/cri-o v1.22.3) and I'm not sure if it's for the same reason, but I have this problem as well.

May 23 02:07:19 node3 kubelet[1629]: W0523 02:07:19.310875    1629 container.go:590] Failed to update stats for container "/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf9db5cc5_3578_421f_aeac_4267e7f91bfa.slice/crio-3afad8de4220d0906aa0ae0df23acd9389aa2bbef3e26702d91a5a1adec9b813.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/e3c6e2330f35d040a4aec5dc84d291933d157d2c610d03e0e0db9be66b335a5d/diff: stat failed on /var/lib/containers/storage/overlay/e3c6e2330f35d040a4aec5dc84d291933d157d2c610d03e0e0db9be66b335a5d/diff with error: no such file or directory, continuing to push stats
May 23 02:11:09 node3 kubelet[1629]: E0523 02:11:09.730227    1629 cadvisor_stats_provider.go:146] "Unable to fetch pod log stats" err="open /var/log/pods/airflow_cyclehistorycaculatecyclehistory.003e45820f4e4dedbfbbaa12f9586272_1661ed0a-2550-49f9-8d07-364408aac6dc: no such file or directory" pod="airflow/cyclehistorycaculatecyclehistory.003e45820f4e4dedbfbbaa12f9586272"

@aneagoe
Copy link

aneagoe commented May 24, 2022

Here's another really interesting bit of information. In my case, this issue seems to be only affecting VMs and not bare-metal nodes. It's completely puzzling, but I have 12 bare-metal nodes and 1 VM node in one cluster (the VM is just for testing) and it manifests only on the VM node. As well, my test cluster is made up fully of VMs (ie no bare-metal) and there it happens across the board. The bare metal nodes where the VMs are running are the same are same hardware as the bare-metal nodes of my main cluster (just more NVMe drives).

Edit: amended to clarify this is observed in my case running OKD.

I figured out the reason for the discrepancy. The VM node was using cgroups v2, while the bare-metal, older nodes, had the kernel parameter systemd.unified_cgroup_hierarchy=0. Removing that and rebooting, triggers the problem.

@rkojedzinszky
Copy link

I can confirm this too, it seems that it happens with cgroups v2.

@rpthms
Copy link
Author

rpthms commented Jun 4, 2022

Did some testing and I can confirm that switching from CRI-O to containerd has stopped these log messages from popping up all the time. For now, this seems like a reasonable solution, at least for me.

Using Cgroups v2
Kubernetes Version: v1.23.7
OS: Ubuntu 22.04
Containerd Version: v1.6.4
Runc Version: 1.1.1

EDIT: Kubernetes Version was v1.23, not v1.24

@haircommander
Copy link
Contributor

Another option is one could switch to the CRI stats provider by setting kubelet's container_runtime_endpoint to unix:///run/crio/crio.sock instead of unix:///var/run/crio/crio.sock

@aneagoe
Copy link

aneagoe commented Jun 27, 2022

Another option is one could switch to the CRI stats provider by setting kubelet's container_runtime_endpoint to unix:///run/crio/crio.sock instead of unix:///var/run/crio/crio.sock

Is this also an option for OKD/OCP? Is there any progress on a fix or should we look at the above workaround instead?

@SISheogorath
Copy link

Out of curiosity, why does this work?

Another option is one could switch to the CRI stats provider by setting kubelet's container_runtime_endpoint to unix:///run/crio/crio.sock instead of unix:///var/run/crio/crio.sock

/cc @haircommander

@haircommander
Copy link
Contributor

Is this also an option for OKD/OCP? Is there any progress on a fix or should we look at the above workaround instead?

This would work for OKD, but wouldn't be supported for OCP OOTB. #108855 should fix, but it needs a reviewer/approver...

Out of curiosity, why does this work?

Fun kubernetes fact! The CRI-O team found performance issues with the CRI stats provider, so there's a hardcoded check to see if the socket is /var/run/crio/crio.sock here. However, since /var/run is symlinked to /run, and the check is a string comparison, one can use /run/crio/crio.sock and not trigger that toggle, turning on CRI stats provider. We're working on fixing the perf troubles with CRI stats KEP 2371

@ish-xyz
Copy link

ish-xyz commented Aug 10, 2022

Using unix:///run/crio/crio.sock doesn't solve the problem on my environment, we're still seeing the same error logs.
We're running:
Flatcar 3033.2.4
Kubelet 1.22.4
CRI-O 1.22.5
cgroup v2 enabled.

W0810 11:05:50.707417 3143935 container.go:586] Failed to update stats for container "/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6c2c7965_7e29_4ed4_8e7f_b66cfaca4c22.slice/crio-14c3ee4d1d08de5a88a28dc24c9189977cf870c03551e1502bccf9d07fc46f45.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/f5aa147151a03909ac056604b64cfa7776874443a855d9c4eeb6e4c70cd10e39/diff: stat failed on /var/lib/containers/storage/overlay/f5aa147151a03909ac056604b64cfa7776874443a855d9c4eeb6e4c70cd10e39/diff with error: no such file or directory, continuing to push stats

nikhiljha added a commit to ocf/puppet that referenced this issue Aug 15, 2022
As explained in kubernetes/kubernetes#106957 (comment) -- the CRI-O team found performance issues in the CRI stats provider, so there's a hack in kubelet which falls back to cadvisor when crio is being used.

This is currently broken and causing a bunch of spam in our logs that looks like  "Unable to fetch pod log stats" -- leading to some nodes (like jaws, with a 64G disk) to become NotReady due to disk pressure.

This commit works around the above issue by subverting the string check using the fact that `/run` is symlinked to `/var/run`.
nikhiljha added a commit to ocf/puppet that referenced this issue Aug 15, 2022
As explained in kubernetes/kubernetes#106957 (comment) -- the CRI-O team found performance issues in the CRI stats provider, so there's a hack in kubelet which falls back to cadvisor when crio is being used.

This is currently broken and causing a bunch of spam in our logs that looks like  "Unable to fetch pod log stats" -- leading to some nodes (like jaws, with a 64G disk) to become NotReady due to disk pressure.

This commit works around the above issue by subverting the string check using the fact that `/run` is symlinked to `/var/run`.
nikhiljha added a commit to nikhiljha/kubernetes that referenced this issue Aug 15, 2022
This workaround was implemented due to performance issues, but has since
caused a new issue kubernetes#106957
that users are working around by switching away from CRI-O, or renaming
their CRI socket from /var/run to /run to fool the string comparison we
use to detect CRI-O. Since it appears CRI-O now works fine without this
hack, I think we can just remove it to alleviate some headache.

I'm not 100% sure if this is no longer needed, but that's why I'm opening
the PR. If this looks good to others, I can rip out the rest of the legacy
implementation as well.
@ratzalex
Copy link

I had this error on Kubernetes version 1.24, with CRI-O 1.24. Today I did a fresh install with Kubernetes 1.25 + CRI-O 1.25 (with the same machine types and OS versions) and the error went away. I tried deployment scale-down, pod deletion, but I no longer see the mentioned "Unable to fetch log stats" messages in the log.

  • VirtualBox machines
  • OS: RHEL 9 with kernel 5.14.0-70.22.1.el9_0.x86_64
  • Kubernetes version: 1.25.0
  • CRI-O version: 1.25.0
  • Runc version 1.1.3
  • Using cgroups v2

@camaeel
Copy link

camaeel commented Jan 24, 2023

We are running kubernetes 1.24.9 with cri-o 1.24.4 and runc 1.1.4 on Ubuntu 22.04 with cgroups v2 and experienced the same issue. For some time I was running a deployment with a dozen of pods and rotated them every 10 seconds to create a lot of garbage in cgroups.
This caused rather slow increase in both mem & cpu usage and a lot of error messages like :

Jan 24 13:27:36 ip-10-227-245-136 kubelet[782032]: W0124 13:27:36.925301  782032 container.go:589] Failed to update stats for container "/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod09f8793d_db6c_4f54_a49d_fe583bf89c7a.slice/crio-fb8fec12d6dd5917df6aecc015bbba81eaedc365a069851fd656b92203426bad.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/e5ece8da121ed7166789472f80b3bc6627909ebdc897f992062d7f2d86244093/diff: stat failed on /var/lib/containers/storage/overlay/e5ece8da121ed7166789472f80b3bc6627909ebdc897f992062d7f2d86244093/diff with error: no such file or directory, continuing to push stats

Kubelet mem & cpu usage:
Screenshot 2023-01-24 at 15 09 11
Screenshot 2023-01-24 at 15 10 25

After I stopped generating those pods the resource usage stopped.

After running @aneagoe's script from #106957 (comment) the CPU&mem usage dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging a pull request may close this issue.