Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s server crashing on Raspberry Pi 4 (8GB) #6654

Closed
samip5 opened this issue Dec 15, 2022 · 14 comments
Closed

k3s server crashing on Raspberry Pi 4 (8GB) #6654

samip5 opened this issue Dec 15, 2022 · 14 comments

Comments

@samip5
Copy link

samip5 commented Dec 15, 2022

Environmental Info:
K3s Version: v1.24.4+k3s1 (c3f830e)
go version go1.18.1

Node(s) CPU architecture, OS, and Version:

arm64, Ubuntu 22.04 (all except two)
Linux k8s-master1 5.15.0-1021-raspi #23-Ubuntu SMP PREEMPT Fri Nov 25 15:27:43 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

amd64, Ubuntu 22.04 x 2
Linux k8s-worker-amd64-0 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 server, 5 agents

Describe the bug:
My k3s apiserver seems to frequently crash / auto-restart

Steps To Reproduce:

Expected behavior:
I would expect it to not keep frequently crashing.

Actual behavior:
Frequent crashes / auto-restarts of service

Additional context / logs:
k3s.log

@samip5
Copy link
Author

samip5 commented Dec 15, 2022

It happened again when I got an

Dec 15 11:11:26 k8s-master1 k3s[2140000]: E1215 11:11:26.004914 2140000 server.go:218] "Leaderelection lost"

Which I don't understand why or how?

Updated logs:
k3s_20221215T1113.log

@bbkz
Copy link

bbkz commented Dec 15, 2022

I had similar problems when using etcd on sdcards (industrial), they can't really handle etcd as it is write intensive.

After switching to emmc, etcd was happy.

@samip5
Copy link
Author

samip5 commented Dec 15, 2022

I had similar problems when using etcd on sdcards (industrial), they can't really handle etcd as it is write intensive.

After switching to emmc, etcd was happy.

Not running on a SD-card, it's running off of a external SSD.

@bbkz
Copy link

bbkz commented Dec 15, 2022

ok. what i also had to do to make it stable, was cordon the master nodes.

@samip5
Copy link
Author

samip5 commented Dec 15, 2022

ok. what i also had to do to make it stable, was cordon the master nodes.

That seems wierd.... I also got only one master. :)

@bbkz
Copy link

bbkz commented Dec 15, 2022

I'm running it stable on 4 fedora rpi4 and 4 odroid n2+ with 3 master nodes.

But i just found the following, and will give raspberry pi os a other try:

Unfortunatly i don't have a other idea.

@brandond
Copy link
Contributor

If you have only a single server, there's not really any point in using etcd - especially on raspberry pi, where CPU and IO is already somewhat constrained. You can't go back to sqlite from etcd, but you might consider rebuilding the cluster at some point, and not using etcd. The logs show that your storage (even if it is ssd) is not able to keep up, and it is frequently taking several seconds for etcd to sync your changes to disk - to the point where leader elections are timing out. This is almost exclusively caused by high storage fsync latency.

If you're on a node with older iptables, you might take a look at the --prefer-bundled-bin flag available in the releases coming out this month - but that will only fix the issue with growing iptables rulesets, it will not do anything about disk latency.

@samip5
Copy link
Author

samip5 commented Dec 16, 2022

If you have only a single server, there's not really any point in using etcd - especially on raspberry pi, where CPU and IO is already somewhat constrained. You can't go back to sqlite from etcd, but you might consider rebuilding the cluster at some point, and not using etcd. The logs show that your storage (even if it is ssd) is not able to keep up, and it is frequently taking several seconds for etcd to sync your changes to disk - to the point where leader elections are timing out. This is almost exclusively caused by high storage fsync latency.

If you're on a node with older iptables, you might take a look at the --prefer-bundled-bin flag available in the releases coming out this month - but that will only fix the issue with growing iptables rulesets, it will not do anything about disk latency.

The crashing was already happening when running SQLite (not sure why tho) and as that doesn't keep db backups, I thought etcd would be better used for that reason.

It seems that I should move the master to be not on a Pi or rather have it running on a CM4 with emmc storage?

@brandond
Copy link
Contributor

brandond commented Dec 16, 2022

I have personally run K3s on a Pi4b with SSD using etcd with no issues. I have also used sqlite on SDHC without issues. However in both cases I made sure that IO-intensive workloads were not using the same disk as the datastore - I put everything on NFS PVCs and minimized large image pull operations. The key is just to make sure that there's not a lot of other IO that will need to be flushed before the datastore write can be completed.

@samip5
Copy link
Author

samip5 commented Dec 17, 2022

It seems that longhorn was scheduled on the master, which is probably a bad thing so I evicted it via toleration. Let's see if that helps.

@brandond
Copy link
Contributor

brandond commented Dec 17, 2022

Oh yeah that would do it. If you're going to do LH, try to put it on a separate physical disk from the datastore to avoid competing with it for iops.

@samip5
Copy link
Author

samip5 commented Dec 17, 2022

Two USB enclosures with SSD's would probably result in competing for USB bandwidth (at least on a Pi4). :)

@bbkz
Copy link

bbkz commented Dec 17, 2022

I got me some RasPiKey which are EMMC storage keys to be used in the sdcard slot.

@caroline-suse-rancher
Copy link
Contributor

I'm going to convert this to a discussion just in case someone runs into the same thing. Doesn't appear to be a clear K3s bug though.

@k3s-io k3s-io locked and limited conversation to collaborators Apr 19, 2023
@caroline-suse-rancher caroline-suse-rancher converted this issue into discussion #7317 Apr 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants