New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--cluster-reset renders entire cluster hosed #5070
Comments
This is a bug in etcd 3.5.0; the most recent round of k3s releases has updated to etcd 3.5.1 which includes a fix for this issue. Please upgrade your version and try again. duplicate of #3997 |
I am curious why just rebooting your nodes causes the cluster to break though. Are your servers getting different addresses from DHCP when they come back up? |
No, it is all static ips.
Me too... |
Upgrade first, see what happens. If you still run into issues with nodes not rejoining after a reboot, post the logs from that failure. |
The originally rebooted node still fails to connect, and basically repeating:
|
Just went through the whole thing again, here's where things get interesting:
|
From one of the other nodes, I think this is the main issue?
It seems that etcd may need longer to transfer bigger snapshots and why ~3 weeks in my environment is "long enough." |
@brandond this smells suspiciously like a timeout somewhere. Is there any way that you know of to configure it? |
Aha, there are no snapshots in the snapshot directory for the leader 🤔 so it has nothing to transfer. The original node that (re)bootstrapped the cluster, however, does have snapshots... |
This is all internal etcd stuff that we don't manage too closely. I believe etcd takes an internal snapshot to bring the newly joined member up to speed... that said I've never seen an IO timeout while the member is streaming the initial from the peer: Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"64f8579255e91227","remote-peer-id":"689da5af1f9c0825"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.986+0100","caller":"etcdserver/server.go:767","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"64f8579255e91227","forward-ticks":8,"forward-duration":"4s","election-ticks":10,"election-timeout":"5s","active-remote-members":2}
Feb 03 21:58:46 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:58:46.964+0100","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"64f8579255e91227","remote-snapshot-sender-id":"689da5af1f9c0825","incoming-snapshot-index":11225279,"error":"read tcp 192.168.100.3:2380->192.168.100.2:25862: i/o timeout"} Can you attach logs from all three nodes, covering the time period where you're restarting the node that fails to come back up? Is there anything unique about your configuration? Odd network topology, slow disks, unusual k3s cli flags? |
I'm able to reproduce it ~1/3 times, so I'm able to most likely get the first and second node in the cluster, but so far I haven't been lucky enough to get all three.
I'll get you some logs tomorrow, due to all the failing services trying to anti-affinity themselves, it's pretty noisy. I'll see if I can get them scaled to 0 for some nice clean logs.
The only thing I can think of is that it's bare-metal with a VLAN between the nodes. Here are the k3s flags:
|
Both logs can be found in this gist. "master" node was started with:
Second node was started with:
Looking at the logs, it seems like there is (maybe) an etcd bug?
and then on the receiving end:
Those sizes are vastly different... |
That does appear to be a bug in etcd; one node reports expecting a much smaller snapshot than the other node is prepared to send. Are these nodes both the same architecture? I can't imagine that would be it, but there must be something unique about your environment since I haven't run into this anywhere else and can't reproduce it on my end. |
Yeah, they're the same architecture, in fact they're almost the exact same with the only difference being the brand of networking card. I've tried doing a manual restore using etcdctl and the disaster recover instructions. It ended up taking a slightly different way to the same result. Anyway, I did discover this in the logs when the server comes online:
So, maybe there is some v2 in the snapshot which is causing an issue, also maybe why you can't replicate it? It doesn't explain why I see this every three weeks ... but maybe every few weeks I just run into a different problem :) |
Hmm, there could also be some interesting bugs in how etcd calculates sizes since it up-casts to uint64, does operations and (I don't know golang too well) and then casts back to an int, which I assume will result in an overflow or some other shenanigans in some specific conditions. Granted the limit is hard-coded at 512mb, which should never reach the edge of an int... Anyway, I don't know what to do except convert to a single node cluster with two workers OR rebuild the cluster from scratch. |
Wireshark to the rescue. I noticed a number of retransmissions and other shenanigans on the link. So, I switched the cluster to use the external IP address and everything started working correctly. I don't think this is directly a k3s or etcd bug, but maybe it could be more resilient. Thanks for helping me @brandond! |
Environmental Info:
K3s Version:
k3s version v1.22.5+k3s1 (405bf79)
go version go1.16.10
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
3 servers
Describe the bug:
If a cluster is up for more than approximately 3 weeks, rebooting a node will cause that node to be unable to rejoin the cluster. The last few times this happened, I just rebuilt from scratch. This time, I decided to try and use
--cluster-reset
to salvage the cluster. However, after it runs, it tells me to start as normal, so I runsystemctl start k3s
and it fails with:Steps To Reproduce:
systemctl stop k3s
on the now drained nodek3s --cluster-reset
on the remaining nodeExpected behavior:
A cluster not to fail just because it has been running a few weeks...
Actual behavior:
Any 3 node cluster will fall apart if a node is rebooted after running more than a few weeks.
Additional context / logs:
I'm aware of etcd quorum and as mentioned, this is not an issue if the cluster has been running for a "short" period of time. The cluster handles this all gracefully. The node being rebooted is not the leader and no voting occurs during the reboot, thus the cluster should continue running just fine. I don't believe this is the cause of this issue.
Backporting
The text was updated successfully, but these errors were encountered: