-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s etcd panics after +10 hours on idle pi4 8GB cluster #2938
Comments
extra crash log of mnode1 I have 3 raspberry pi running all single master nodes running same setup and not connected to each other. mnode1
mnode1
|
extra info this is mnode 3 with exact same hard en software setup but still running mnode3
top
|
Can you attach (rather than paste inline) complete logs from all three nodes? The ultimate cause of the crash is the usual - excessive etcd latency causing a failure to renew a critical controller lease:
You'll need to correlate other system activity to figure out why etcd wasn't responding properly. The usual cause of this is IO contention (something else doing a burst of heavy IO) on one of the etcd servers that prevents it from acknowledging writes in a timely manner. Are you using SD cards, or external USB-attached storage? Etcd pushes an EXTREMEMLY high write volume that will quickly burn out SD cards, especially smaller ones. |
@brandond thanks for your quick reply. FYI I do not do anything on these pi4, besides running an idle k3s cluster. moreinfo,
note: I have only the beginning for a couple of hours and the end for a couple of hours because of journal vacuuming settings. mnode1 (crashed and zombie) mnode2 (crashed and zombie) mnode3 ( crashed feb 16 21:13:58 CET, but recovered ) extra these are bonnie++ tests mnode1
mnode2
mnode3
|
extra this is iostat of mnode3 which is still running. To give you an idea of disk "load" |
Can you show iostat from the period in which the nodes are crashing? These latencies are WAY higher than I would expect to see from SSDs. Have you tried this same configuration with either a single SSD in the pool, and/or a single SSD using a different filesystem (xfs/ext4)? Also, if you are planning on running standalone nodes and don't need any HA, you might try just using the default sqlite instead of etcd, as it is MUCH less demanding in terms of IO throughput and latency.
|
I did not run iostat while the nodes were running, but I can start new tests on all nodes and run iostat besides k3s. Would this be a good settings for iostat? And would you like to have other statistics collected?
FYI I am using flash drive samsung fit 128GB usb3.1 usb stick , not an ssd. I can also make other test setups besides the above nodes
My plan is to eventually run 4 PI4 8GB + 4xPi4 4GB nodes in cluster. But first I need to see if 1 single RPI4 8GB node with no load running on it, is stable and not crashing. What is interesting is that mnode1 and mnode2 are crashing and k3s process remains a zombie process, while mnode3 eventually also crashed with same error(leaderelection lost) but automatically restarts k3s successfully. To be clear all 3 PI4 nodes have exactly same hardware specs and software setup. Hardware specs
Software specs
|
Ah OK, so it's just a generic USB flash drive? I have a Pi4 that is booting off a 128GB NVME drive in a USB3 NVME enclosure and it works great. "Thumb drive" type devices may be a bit more limited in terms of performance. |
Small update, when I run k3s same hardware and os and same settings but use ext4 instead of zfs then it runs stable. My guess is that {async} zfs icm with usb thumb drive is causing the crash on k3s with etcd. |
Environmental Info:
K3s Version:
v1.20.2+k3s1 (1d4adb0) go version go1.15.5
Node(s) CPU architecture, OS, and Version:
Arm64 (raspberry pi 4 8GB), Ubuntu 20.10 with zfs
uname -a
Linux mnode1 5.8.0-1015-raspi #18-Ubuntu SMP PREEMPT Fri Feb 5 06:09:58 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
containerd version
containerd github.com/containerd/containerd 1.3.7-0ubuntu3.2
calico version
calico-3.17.2
zfs version
zfs-0.8.4-1ubuntu11.1
zfs-kmod-0.8.4-1ubuntu11.1
Cluster Configuration:
1 master
Describe the bug:
After installing basic k3s 1node cluster with calico and containerd after hours k3s panics and is turned into zombie process. Only a restart can kill k3s. The cluster is not doing anything besides running kubernetes dashboard. I have 3 identical pi 4 8GB running with same configuration both hard and software and already 2 out of 3 crashed. 1 after 16 hours and the other after around 25 hours
Steps To Reproduce:
install script used raspberry pi running ubuntu20.10
Before running script I had containerd installed with following settings
install containerd
add containerd config
add config /etc/containerd/config.toml
add following
create containerd dataset
Expected behavior:
Just running stable and not crashing
Actual behavior:
k3s panics after +10 hours and leaves k3s process as a zombie process
Additional context / logs:
bonnie++
journalctl -b --no-pager -u k3s
mnode2
The text was updated successfully, but these errors were encountered: