Single-node k3s etcd failed to reconcile #9701

ChipWolf · 2024-03-08T15:12:01Z

Environmental Info:
K3s Version:

k3s version v1.26.3+k3s1 (01ea3ff2)
go version go1.19.7

Node(s) CPU architecture, OS, and Version:

Linux pi0 5.15.0-1047-raspi #50-Ubuntu SMP PREEMPT Fri Feb 9 13:48:00 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

Fatal error on server start

FATA[0004] Failed to reconcile with temporary etcd: wal: max entry size limit exceeded, recBytes: 865, fileSize(19681280) - offset(19681192) - padBytes(7) = entryLimit(81)

Steps To Reproduce:

Unclear, however I have a backup of the db directory if anyone wishes to investigate.
The cluster was in routine use when the k3s server crashed.
No changes were made prior to the service crashing.
Subsequent attempts to start the service failed with the above error.

Additional context / logs:

Snapshot recoveries were failing with the same error, I eventually managed to recover the cluster by first moving /var/lib/rancher/k3s/server/db and then restoring an etcd snapshot.

The text was updated successfully, but these errors were encountered:

brandond · 2024-03-08T16:20:22Z

It sounds like your etcd datastore was corrupted. Did the k3s process crash, or did the OS experience a kernel panic and reboot? I've only ever seen anything like this when the node is unexpectedly restarted or powered off, and files are corrupted on disk. I will also note that raspberry pis are highly likely to experience filesystem corruption when using SD cards; we do not recommend using etcd with sd cards under any circumstances.

Removing the files from disk and restoring from a snapshot is the proper way to address datastore corruption. Since the errors are from etcd, and were triggered by a crash, I'm not sure we have anything to fix here in K3s.

ChipWolf · 2024-03-08T17:07:08Z

It sounds like your etcd datastore was corrupted. Did the k3s process crash, or did the OS experience a kernel panic and reboot? I've only ever seen anything like this when the node is unexpectedly restarted or powered off, and files are corrupted on disk. I will also note that raspberry pis are highly likely to experience filesystem corruption when using SD cards; we do not recommend using etcd with sd cards under any circumstances.

Removing the files from disk and restoring from a snapshot is the proper way to address datastore corruption. Since the errors are from etcd, and were triggered by a crash, I'm not sure we have anything to fix here in K3s.

@brandond external usb/sata SSD, no corruption, system didn't reboot or panic, hence the issue.

brandond · 2024-03-08T17:14:25Z

Can you provide the complete logs from journald covering the time period before and after the crash? We don't have any way to handle etcd database file corruption in k3s, but if there is a preventable crash occurring we can take a look at that.

brandond closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-node k3s etcd failed to reconcile #9701

Single-node k3s etcd failed to reconcile #9701

ChipWolf commented Mar 8, 2024

brandond commented Mar 8, 2024 •

edited

Loading

ChipWolf commented Mar 8, 2024

brandond commented Mar 8, 2024

Single-node k3s etcd failed to reconcile #9701

Single-node k3s etcd failed to reconcile #9701

Comments

ChipWolf commented Mar 8, 2024

brandond commented Mar 8, 2024 • edited Loading

ChipWolf commented Mar 8, 2024

brandond commented Mar 8, 2024

brandond commented Mar 8, 2024 •

edited

Loading