Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s log fill up my disk in short time #7128

Open
liyimeng opened this issue Mar 21, 2023 · 5 comments
Open

k3s log fill up my disk in short time #7128

liyimeng opened this issue Mar 21, 2023 · 5 comments
Labels
kind/bug Something isn't working

Comments

@liyimeng
Copy link
Contributor

Environmental Info:
K3s Version: 1.25.6

Node(s) CPU architecture, OS, and Version:

x86_64, ubuntu 22.04
Cluster Configuration:

3 server 3 nodes
Describe the bug:

I have the fresh installed cluster running for 3-4 days, suddenly one of the master get filled up by k3s-service.log.
which keep printing

msg="Failed to test temporary data store connection: failed to dial endpoint http://127.0.0.1:2399 with maintenance client: context canceled"

MILLIONS line of this text make the k3s-service.log grow into hundreds of GB in a couple of hours.

Steps To Reproduce:

  • Installed K3s:
    install 1.25.6
    Expected behavior:

cluster nodes keep running stalely.
Actual behavior:

one of the master get filled up with massive log printout, then kill the node in the end.

Additional context / logs:

It keep printing

msg="Failed to test temporary data store connection: failed to dial endpoint http://127.0.0.1:2399 with maintenance client: context canceled"
@brandond
Copy link
Member

You'll need to provide more than just the one repeating log message. Can you go back in the logs to just before that message started repeating, or perhaps just stop k3s, clean up the logs, and then start it again so that you can get the logs from the beginning of startup onwards?

You might also confirm that nothing else obvious has gone wrong with this host, such as running out of disk space.

@liyimeng
Copy link
Contributor Author

@brandond Thanks for attention! Yes, I know the log provide no clue here. When I see the issue, the log file is 400GB+, impossible to see the beginning part of log. I restart the service to recollect the log, but the problem is gone when doing so.

So losing the chance to collect meaningful log. Is this because of something going wrong with embedded etcd?

Btw, my friend said he experienced the same on 1.23.10. Rebooting the node and the problem is gone.

I will try to see if I can collect a meaningful log when it occurs again.

@liyimeng
Copy link
Contributor Author

liyimeng commented Mar 27, 2023

It is happening again, I observe that there are more than one k3s server instances are running on the node, even I have stopped the k3s-service.

ps -ef | grep  server | grep k3s
root     11974     1 99 17:54 ?        00:11:03 /sbin/k3s server
root     15326     1 99 16:15 ?        03:30:54 /sbin/k3s server
root     27884     1 47 16:14 ?        00:50:24 /sbin/k3s server
root     32143     1 99 17:50 ?        00:18:19 /sbin/k3s server

My system use openrc to start the service. On the normal node, I have

ps -ef | grep  server | grep k3s
root     37587     1  0 13:49 ?        00:00:00 supervise-daemon k3s-service --start --stdout /var/log/k3s-service.log --stderr /var/log/k3s-service.log --pidfile /var/run/k3s-service.pid --respawn-delay 5 --respawn-max 0 /sbin/k3s -- server --disable servicelb --server https://kubernetes --node-external-ip 172.27.13.170 --protect-kernel-defaults=true --secrets-encryption=true --kube-apiserver-arg=audit-policy-file=/var/lib/rancher/k3s/server/audit.yaml --kube-apiserver-arg=audit-log-path=/var/lib/rancher/k3s/server/audit/audit.log --kube-apiserver-arg=audit-log-maxage=30 --kube-apiserver-arg=audit-log-maxbackup=10 --kube-apiserver-arg=audit-log-maxsize=100 --kube-apiserver-arg=request-timeout=300s --kube-apiserver-arg=service-account-lookup=true --kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurity,NamespaceLifecycle,ServiceAccount --kube-apiserver-arg=feature-gates=MemoryQoS=true,PodSecurity=true --kube-controller-manager-arg=terminated-pod-gc-threshold=10 --kube-controller-manager-arg=use-service-account-credentials=true --kubelet-arg=streaming-connection-idle-timeout=5m --kubelet-arg=make-iptables-util-chains=true --node-label k3os.io/mode=local --node-label k3os.io/version=0404260
root     37588 37587 27 13:49 ?        01:09:30 /sbin/k3s server

For some reason, k3s-service script dose not actually kill '/sbin/k3s server' processes, the left over process conflict to each other, and racing to write into the log files, hence collecting GBs of log in a couple of minutes.

@brandond Is there any chance we can improve create_openrc_service_file() in the install.sh, make it robust, avoiding such situation from happening?

liyimeng added a commit to liyimeng/k3s that referenced this issue Mar 27, 2023
Before starting up new k3s instance, make sure old ones are gone. 
To fix issue  k3s-io#7128
liyimeng added a commit to liyimeng/k3s that referenced this issue Mar 28, 2023
Before starting up new k3s instance, make sure old ones are gone.
To fix issue  k3s-io#7128

Signed-off-by: Liyi Meng <meng.mobile@gmail.com>
@caroline-suse-rancher caroline-suse-rancher added the kind/bug Something isn't working label Apr 18, 2023
@caroline-suse-rancher caroline-suse-rancher added this to the v1.28.3+k3s1 milestone Oct 11, 2023
@caroline-suse-rancher caroline-suse-rancher removed this from the v1.28.3+k3s1 milestone Nov 14, 2023
@caroline-suse-rancher
Copy link
Contributor

@liyimeng is this still an issue for you? I see the open PR, but it's been some time without an update. Thanks!

@liyimeng
Copy link
Contributor Author

liyimeng commented Jan 6, 2024

@caroline-suse-rancher Thanks for your attention! I have been using the solution in the PR to solve this problem. So far so good. Not sure if it can help others.

@stale stale bot removed the status/stale label Jan 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: Peer Review
Development

No branches or pull requests

3 participants