New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet spamming 'Unable to fetch pod log stats' log messages after running cronjobs #106957
Comments
|
/sig node |
|
kubernetes/pkg/kubelet/stats/cadvisor_stats_provider.go Lines 97 to 147 in d5f39eb
|
|
/assign @bobbypage for triage |
|
I have something similar on a fresh cluster I use for educational purposes using kubernetes 22.5 with cri-o. Here's an excerpt: This is from the master node. Nodes run Debian bulsseye in KVM provisioned with Vagrant. I have not yet tested it out with different versions. If I can assist in testing, I'd be happy to do so! |
|
/triage accepted |
|
I've produced the same issue on my clusters. I'm running Debian 11 on bare metal (homelab) and in VM (test). The issue is triggered as soon as a pod is removed. I've seen it both on aregular worker nodes, and untainted master nodes. I've tested under Vagrant with various K8S / CRI-O versions. With CRI-O However, with CRI-O >=
Let me know if I can help by running other tests or capturing specific logs. |
|
Is this only issue with CRI-O? What about containerd? The reports above seem to all be on CRI-O? /cc @haircommander |
|
it's possible it's unrelated, but https://bugzilla.redhat.com/show_bug.cgi?id=2040399 is why I am making this guess |
|
I'm used to installing a cluster using Debian packages, so it took me a few tries to have a working v1.24 cluster. I followed the "Without a package manager" documentation to install from the binaries. I used Kubernetes $ kubectl version
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.3", GitCommit:"30a21e9abdbbeb78d2b7ce59a79e46299ced2742", GitTreeState:"clean", BuildDate:"2022-02-15T20:46:27Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.0-alpha.3", GitCommit:"30a21e9abdbbeb78d2b7ce59a79e46299ced2742", GitTreeState:"clean", BuildDate:"2022-02-15T20:39:18Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
# crictl version
Version: 0.1.0
RuntimeName: cri-o
RuntimeVersion: 1.23.1
RuntimeApiVersion: v1alpha2I'm not certain everything is working fine, but I can schedule a Pod / Deployment / Job, so I assume it's good enough to test this issue. |
|
I'm also hitting the issue in my lab environment(Vagrant). Running Rocky Linux 8.5 Kubernetes 1.22.6 and CRI-O 1.22.2 |
|
ah hah! I found a fix #108855 (and found out why it's only being hit in cri-o and not cadvisor) |
|
We're also having this issue on a cluster hosting Gitlab CI jobs, this is quite a trouble since many jobs are spawned and removed daily. Kubernetes version: v1.22.7 Is there any news about the fix ? I looked into @haircommander 's PR, it seemed ok to me but I'm not really experienced in Go... Nevertheless I'd be happy to help |
|
up |
|
@bobbypage I can confirm on a bare metal cluster that I'm running on containerd that I am also seeing these log messages. |
|
Observing similar issues on Cluster running version No idea where the discrepancy is coming from. |
|
Tested with |
|
@aneagoe to verify my suspicion, I would be interested in seeing |
|
@haircommander I'm sure they're not in crio anymore. I'm only using |
|
I took one of my clusters that's suffering heavily on this issue (hosting gitlab CI jobs), reactivated the logs on it for a second, doesn't take long for the kubelet to log things like this: Trying to inspect the pod with I tested with a few pods and the result is always the same. But the slices are still here tho: So in my opinion crictl return the right things, it's either the kubelet that doesn't update correctly the list of pods to watch or the systemd problem with the slices. |
|
@haircommander I've attached the logs and requested outputs here (sanitized hostnames). Let me know if you'd need anything else; for now I'll leave them around and not scale up the DS on the test cluster. |
|
I've posted a quick gist with all the manifests I'm using to deploy a garbage collector daemon set. Nothing fancy, just a bit of patchwork to address the issues I was having. |
|
Here's another really interesting bit of information. In my case, this issue seems to be only affecting VMs and not bare-metal nodes. It's completely puzzling, but I have 12 bare-metal nodes and 1 VM node in one cluster (the VM is just for testing) and it manifests only on the VM node. As well, my test cluster is made up fully of VMs (ie no bare-metal) and there it happens across the board. The bare metal nodes where the VMs are running are the same are same hardware as the bare-metal nodes of my main cluster (just more NVMe drives). Edit: amended to clarify this is observed in my case running OKD. |
|
Sorry but I have the same issue on bare-metal nodes (x86 ArchLinux kernel 5.17.5 with K8S 1.23.6/Crio 1.23.2) |
The issue in title happens on bare metal nodes too. |
|
I've got the same issue on bare-metal nodes(k8s v1.23.0/cri-o v1.22.3) and I'm not sure if it's for the same reason, but I have this problem as well. |
I figured out the reason for the discrepancy. The VM node was using cgroups v2, while the bare-metal, older nodes, had the kernel parameter |
|
I can confirm this too, it seems that it happens with cgroups v2. |
|
Did some testing and I can confirm that switching from CRI-O to containerd has stopped these log messages from popping up all the time. For now, this seems like a reasonable solution, at least for me. Using Cgroups v2 EDIT: Kubernetes Version was v1.23, not v1.24 |
|
Another option is one could switch to the CRI stats provider by setting kubelet's container_runtime_endpoint to |
Is this also an option for OKD/OCP? Is there any progress on a fix or should we look at the above workaround instead? |
|
Out of curiosity, why does this work?
/cc @haircommander |
This would work for OKD, but wouldn't be supported for OCP OOTB. #108855 should fix, but it needs a reviewer/approver...
Fun kubernetes fact! The CRI-O team found performance issues with the CRI stats provider, so there's a hardcoded check to see if the socket is |
|
Using unix:///run/crio/crio.sock doesn't solve the problem on my environment, we're still seeing the same error logs. |
As explained in kubernetes/kubernetes#106957 (comment) -- the CRI-O team found performance issues in the CRI stats provider, so there's a hack in kubelet which falls back to cadvisor when crio is being used. This is currently broken and causing a bunch of spam in our logs that looks like "Unable to fetch pod log stats" -- leading to some nodes (like jaws, with a 64G disk) to become NotReady due to disk pressure. This commit works around the above issue by subverting the string check using the fact that `/run` is symlinked to `/var/run`.
As explained in kubernetes/kubernetes#106957 (comment) -- the CRI-O team found performance issues in the CRI stats provider, so there's a hack in kubelet which falls back to cadvisor when crio is being used. This is currently broken and causing a bunch of spam in our logs that looks like "Unable to fetch pod log stats" -- leading to some nodes (like jaws, with a 64G disk) to become NotReady due to disk pressure. This commit works around the above issue by subverting the string check using the fact that `/run` is symlinked to `/var/run`.
This workaround was implemented due to performance issues, but has since caused a new issue kubernetes#106957 that users are working around by switching away from CRI-O, or renaming their CRI socket from /var/run to /run to fool the string comparison we use to detect CRI-O. Since it appears CRI-O now works fine without this hack, I think we can just remove it to alleviate some headache. I'm not 100% sure if this is no longer needed, but that's why I'm opening the PR. If this looks good to others, I can rip out the rest of the legacy implementation as well.
|
I had this error on Kubernetes version 1.24, with CRI-O 1.24. Today I did a fresh install with Kubernetes 1.25 + CRI-O 1.25 (with the same machine types and OS versions) and the error went away. I tried deployment scale-down, pod deletion, but I no longer see the mentioned "Unable to fetch log stats" messages in the log.
|
|
We are running kubernetes 1.24.9 with cri-o 1.24.4 and runc 1.1.4 on Ubuntu 22.04 with cgroups v2 and experienced the same issue. For some time I was running a deployment with a dozen of pods and rotated them every 10 seconds to create a lot of garbage in cgroups. After I stopped generating those pods the resource usage stopped. After running @aneagoe's script from #106957 (comment) the CPU&mem usage dropped. |


What happened?
I have a cronjob that runs a Python script every 5 minutes. After the cronjob finishes running, I start seeing a bunch of log messages in my systemd journal like the following:
Those directories in /var/log don't exist as they belong to old pods that have finished running.
These messages keep increasing in volume as time passes (almost received a 1000 of those messages at one time, after which I had to reboot the node to fix the issue). It almost looks like the kubelet is not cleaning up the old pods for some reason. I can't see these old pods when I run
kubectl get pods -A, so I'm assuming their is some issue with the kubelet's cleanup process.What did you expect to happen?
kubelet (or cadvisor) shouldn't try to read logs for pods that don't exist.
How can we reproduce it (as minimally and precisely as possible)?
Setup a short lived cronjob running every 5 minutes and observe the logs (Kubernetes v1.22.4, CRI-O v1.22.1, Ubuntu 20.04)
Anything else we need to know?
No response
Kubernetes version
Cloud provider
Baremetal cluster
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: