Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-node k3s etcd failed to reconcile #9701

Closed
ChipWolf opened this issue Mar 8, 2024 · 3 comments
Closed

Single-node k3s etcd failed to reconcile #9701

ChipWolf opened this issue Mar 8, 2024 · 3 comments

Comments

@ChipWolf
Copy link

ChipWolf commented Mar 8, 2024

Environmental Info:
K3s Version:

k3s version v1.26.3+k3s1 (01ea3ff2)
go version go1.19.7

Node(s) CPU architecture, OS, and Version:

Linux pi0 5.15.0-1047-raspi #50-Ubuntu SMP PREEMPT Fri Feb 9 13:48:00 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

Fatal error on server start

FATA[0004] Failed to reconcile with temporary etcd: wal: max entry size limit exceeded, recBytes: 865, fileSize(19681280) - offset(19681192) - padBytes(7) = entryLimit(81)

Steps To Reproduce:

  • Unclear, however I have a backup of the db directory if anyone wishes to investigate.
  • The cluster was in routine use when the k3s server crashed.
  • No changes were made prior to the service crashing.
  • Subsequent attempts to start the service failed with the above error.

Additional context / logs:

Snapshot recoveries were failing with the same error, I eventually managed to recover the cluster by first moving /var/lib/rancher/k3s/server/db and then restoring an etcd snapshot.

@brandond
Copy link
Member

brandond commented Mar 8, 2024

It sounds like your etcd datastore was corrupted. Did the k3s process crash, or did the OS experience a kernel panic and reboot? I've only ever seen anything like this when the node is unexpectedly restarted or powered off, and files are corrupted on disk. I will also note that raspberry pis are highly likely to experience filesystem corruption when using SD cards; we do not recommend using etcd with sd cards under any circumstances.

Removing the files from disk and restoring from a snapshot is the proper way to address datastore corruption. Since the errors are from etcd, and were triggered by a crash, I'm not sure we have anything to fix here in K3s.

@brandond brandond closed this as completed Mar 8, 2024
@ChipWolf
Copy link
Author

ChipWolf commented Mar 8, 2024

It sounds like your etcd datastore was corrupted. Did the k3s process crash, or did the OS experience a kernel panic and reboot? I've only ever seen anything like this when the node is unexpectedly restarted or powered off, and files are corrupted on disk. I will also note that raspberry pis are highly likely to experience filesystem corruption when using SD cards; we do not recommend using etcd with sd cards under any circumstances.

Removing the files from disk and restoring from a snapshot is the proper way to address datastore corruption. Since the errors are from etcd, and were triggered by a crash, I'm not sure we have anything to fix here in K3s.

@brandond external usb/sata SSD, no corruption, system didn't reboot or panic, hence the issue.

@brandond
Copy link
Member

brandond commented Mar 8, 2024

Can you provide the complete logs from journald covering the time period before and after the crash? We don't have any way to handle etcd database file corruption in k3s, but if there is a preventable crash occurring we can take a look at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants