Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--cluster-reset renders entire cluster hosed #5070

Closed
1 task
withinboredom opened this issue Feb 3, 2022 · 16 comments
Closed
1 task

--cluster-reset renders entire cluster hosed #5070

withinboredom opened this issue Feb 3, 2022 · 16 comments

Comments

@withinboredom
Copy link
Contributor

withinboredom commented Feb 3, 2022

Environmental Info:
K3s Version:

k3s version v1.22.5+k3s1 (405bf79)
go version go1.16.10

Node(s) CPU architecture, OS, and Version:

Linux delightful 5.13.0-28-generic #31-Ubuntu SMP Thu Jan 13 17:41:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux capital 5.13.0-28-generic #31-Ubuntu SMP Thu Jan 13 17:41:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux donkey 5.13.0-28-generic #31-Ubuntu SMP Thu Jan 13 17:41:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 servers

Describe the bug:

If a cluster is up for more than approximately 3 weeks, rebooting a node will cause that node to be unable to rejoin the cluster. The last few times this happened, I just rebuilt from scratch. This time, I decided to try and use --cluster-reset to salvage the cluster. However, after it runs, it tells me to start as normal, so I run systemctl start k3s and it fails with:

Feb 03 20:57:10 delightful k3s[1591061]: {"level":"panic","ts":"2022-02-03T20:57:10.654+0100","caller":"rafthttp/transport.go:346","msg":"unexpected removal of unknown remote peer","remote-peer-id":"ec5b7981eb5dd5de","stacktrace":"github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp/transport.go:346\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp/transport.go:329\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:2303\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:2135\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1357\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1179\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1111\ngithub.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule/schedule.go:157"}
Feb 03 20:57:10 delightful k3s[1591061]: panic: unexpected removal of unknown remote peer
Feb 03 20:57:10 delightful k3s[1591061]: goroutine 295 [running]:
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000ff4000, 0xc005980340, 0x1, 0x1)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.uber.org/zap/zapcore/entry.go:234 +0x58d
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.uber.org/zap.(*Logger).Panic(0xc000160910, 0x4fe1c86, 0x29, 0xc005980340, 0x1, 0x1)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.uber.org/zap/logger.go:227 +0x85
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).removePeer(0xc000fee000, 0xec5b7981eb5dd5de)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp/transport.go:346 +0x58c
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*Transport).RemovePeer(0xc000fee000, 0xec5b7981eb5dd5de)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp/transport.go:329 +0x7d
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange(0xc00182ea00, 0x1, 0xec5b7981eb5dd5de, 0x0, 0x0, 0x0, 0x0, 0xc00158ea80, 0x0, 0x0, ...)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:2303 +0x872
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply(0xc00182ea00, 0xc0044691a0, 0x10e, 0x1a2, 0xc00158ea80, 0xc002624118, 0xc0026240d0, 0x24b1a1d)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:2135 +0x59a
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries(0xc00182ea00, 0xc00158ea80, 0xc0049bc0b0)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1357 +0xe5
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0xc00182ea00, 0xc00158ea80, 0xc0049bc0b0)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1179 +0x88
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8(0x5959b78, 0xc004c88480)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go:1111 +0x3c
Feb 03 20:57:10 delightful k3s[1591061]: github.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0xc0003d5020)
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule/schedule.go:157 +0xf3
Feb 03 20:57:10 delightful k3s[1591061]: created by github.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler
Feb 03 20:57:10 delightful k3s[1591061]:         /go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/pkg/v3/schedule/schedule.go:70 +0x13b
Feb 03 20:57:10 delightful systemd[1]: k3s.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 03 20:57:10 delightful systemd[1]: k3s.service: Failed with result 'exit-code'.

Steps To Reproduce:

  • Run a 3 node cluster for more than a few weeks (this is a requirement, works flawlessly if less than a few weeks)
  • Drain a node
  • Reboot said node
  • See errors on said node about bering unable to connect to etcd
  • See errors on etcd master about sending buffer full
  • drain another node
  • systemctl stop k3s on the now drained node
  • stop k3s on the remaining node
  • run k3s --cluster-reset on the remaining node
  • start k3s when prompted, see error

Expected behavior:

A cluster not to fail just because it has been running a few weeks...

Actual behavior:

Any 3 node cluster will fall apart if a node is rebooted after running more than a few weeks.

Additional context / logs:

I'm aware of etcd quorum and as mentioned, this is not an issue if the cluster has been running for a "short" period of time. The cluster handles this all gracefully. The node being rebooted is not the leader and no voting occurs during the reboot, thus the cluster should continue running just fine. I don't believe this is the cause of this issue.

Backporting

  • Needs backporting to older releases
@brandond
Copy link
Contributor

brandond commented Feb 3, 2022

Feb 03 20:57:10 delightful k3s[1591061]: panic: unexpected removal of unknown remote peer

This is a bug in etcd 3.5.0; the most recent round of k3s releases has updated to etcd 3.5.1 which includes a fix for this issue. Please upgrade your version and try again.

duplicate of #3997

@brandond
Copy link
Contributor

brandond commented Feb 3, 2022

I am curious why just rebooting your nodes causes the cluster to break though. Are your servers getting different addresses from DHCP when they come back up?

@withinboredom
Copy link
Contributor Author

Are your servers getting different addresses from DHCP when they come back up?

No, it is all static ips.

I am curious why just rebooting your nodes causes the cluster to break

Me too...

@brandond
Copy link
Contributor

brandond commented Feb 3, 2022

Upgrade first, see what happens. If you still run into issues with nodes not rejoining after a reboot, post the logs from that failure.

@withinboredom
Copy link
Contributor Author

The originally rebooted node still fails to connect, and basically repeating:

Feb 03 21:46:45 capital k3s[6465]: time="2022-02-03T21:46:45+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:46:45 capital k3s[6465]: time="2022-02-03T21:46:45+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:46:45 capital k3s[6465]: time="2022-02-03T21:46:45+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:46:49 capital k3s[6465]: {"level":"warn","ts":"2022-02-03T21:46:49.159+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00122ea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:46:49 capital k3s[6465]: time="2022-02-03T21:46:49+01:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Feb 03 21:46:50 capital k3s[6465]: {"level":"warn","ts":"2022-02-03T21:46:50.237+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00122ea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:46:50 capital k3s[6465]: time="2022-02-03T21:46:50+01:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Feb 03 21:46:50 capital k3s[6465]: time="2022-02-03T21:46:50+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:46:50 capital k3s[6465]: time="2022-02-03T21:46:50+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:46:50 capital k3s[6465]: time="2022-02-03T21:46:50+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:46:55 capital k3s[6465]: time="2022-02-03T21:46:55+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:46:55 capital k3s[6465]: time="2022-02-03T21:46:55+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:46:55 capital k3s[6465]: time="2022-02-03T21:46:55+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:46:55 capital k3s[6465]: {"level":"warn","ts":"2022-02-03T21:46:55.326+0100","caller":"etcdserver/server.go:2050","msg":"failed to publish local member to cluster through raft","local-member-id":"71dc45d5eb7cb06f","local-member-attributes":"{Name:capital-c6feee58 ClientURLs:[https://192.168.100.3:2379]}","request-path":"/0/members/71dc45d5eb7cb06f/attributes","publish-timeout":"15s","error":"etcdserver: request timed out"}
Feb 03 21:47:00 capital k3s[6465]: time="2022-02-03T21:47:00+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:47:00 capital k3s[6465]: time="2022-02-03T21:47:00+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:47:00 capital k3s[6465]: time="2022-02-03T21:47:00+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:47:02 capital kernel: [UFW BLOCK] IN=enp8s0.4000 OUT= MAC=04:42:1a:1f:02:f0:a8:a1:59:8e:72:b5:08:00 SRC=192.168.100.2 DST=192.168.100.3 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=55636 DPT=2380 WINDOW=0 RES=0x00 RST URGP=0
Feb 03 21:47:04 capital k3s[6465]: {"level":"warn","ts":"2022-02-03T21:47:04.159+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00122ea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:47:04 capital k3s[6465]: time="2022-02-03T21:47:04+01:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Feb 03 21:47:05 capital k3s[6465]: {"level":"warn","ts":"2022-02-03T21:47:05.238+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00122ea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:47:05 capital k3s[6465]: time="2022-02-03T21:47:05+01:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Feb 03 21:47:05 capital k3s[6465]: time="2022-02-03T21:47:05+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:47:05 capital k3s[6465]: time="2022-02-03T21:47:05+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:47:05 capital k3s[6465]: time="2022-02-03T21:47:05+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"

@withinboredom
Copy link
Contributor Author

Just went through the whole thing again, here's where things get interesting:

Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.949+0100","caller":"rafthttp/http.go:257","msg":"receiving database snapshot","local-member-id":"64f8579255e91227","remote-snapshot-sender-id":"689da5af1f9c0825","incoming-snapshot-index":11225279,"incoming-snapshot-message-size-bytes":81929,"incoming-snapshot-message-size":"82 kB"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.980+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"64f8579255e91227","to":"e272d568cc232d52","stream-type":"stream Message"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.980+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"64f8579255e91227","remote-peer-id":"e272d568cc232d52"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.981+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"64f8579255e91227","to":"e272d568cc232d52","stream-type":"stream MsgApp v2"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.981+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"64f8579255e91227","remote-peer-id":"e272d568cc232d52"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"64f8579255e91227","to":"689da5af1f9c0825","stream-type":"stream Message"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"64f8579255e91227","remote-peer-id":"689da5af1f9c0825"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"64f8579255e91227","to":"689da5af1f9c0825","stream-type":"stream MsgApp v2"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"64f8579255e91227","remote-peer-id":"689da5af1f9c0825"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.986+0100","caller":"etcdserver/server.go:767","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"64f8579255e91227","forward-ticks":8,"forward-duration":"4s","election-ticks":10,"election-timeout":"5s","active-remote-members":2}
Feb 03 21:58:46 capital k3s[6596]: time="2022-02-03T21:58:46+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:58:46 capital k3s[6596]: time="2022-02-03T21:58:46+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:58:46 capital k3s[6596]: time="2022-02-03T21:58:46+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:58:46 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:58:46.964+0100","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"64f8579255e91227","remote-snapshot-sender-id":"689da5af1f9c0825","incoming-snapshot-index":11225279,"error":"read tcp 192.168.100.3:2380->192.168.100.2:25862: i/o timeout"}
Feb 03 21:58:50 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:58:50.823+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0027181c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Feb 03 21:58:50 capital k3s[6596]: time="2022-02-03T21:58:50+01:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Feb 03 21:58:51 capital k3s[6596]: time="2022-02-03T21:58:51+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:58:51 capital k3s[6596]: time="2022-02-03T21:58:51+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:58:51 capital k3s[6596]: time="2022-02-03T21:58:51+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:58:56 capital k3s[6596]: time="2022-02-03T21:58:56+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:58:56 capital k3s[6596]: time="2022-02-03T21:58:56+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:58:56 capital k3s[6596]: time="2022-02-03T21:58:56+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:58:56 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:58:56.936+0100","caller":"etcdserver/server.go:2050","msg":"failed to publish local member to cluster through raft","local-member-id":"64f8579255e91227","local-member-attributes":"{Name:capital-bb983397 ClientURLs:[https://192.168.100.3:2379]}","request-path":"/0/members/64f8579255e91227/attributes","publish-timeout":"15s","error":"etcdserver: request timed out, possibly due to connection lost"}
Feb 03 21:59:01 capital k3s[6596]: time="2022-02-03T21:59:01+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:59:01 capital k3s[6596]: time="2022-02-03T21:59:01+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:59:01 capital k3s[6596]: time="2022-02-03T21:59:01+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"
Feb 03 21:59:05 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:59:05.824+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0027181c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:59:05 capital k3s[6596]: time="2022-02-03T21:59:05+01:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Feb 03 21:59:06 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:59:06.864+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0027181c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 21:59:06 capital k3s[6596]: time="2022-02-03T21:59:06+01:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Feb 03 21:59:06 capital k3s[6596]: time="2022-02-03T21:59:06+01:00" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
Feb 03 21:59:06 capital k3s[6596]: time="2022-02-03T21:59:06+01:00" level=error msg="Failed to connect to proxy" error="websocket: bad handshake"
Feb 03 21:59:06 capital k3s[6596]: time="2022-02-03T21:59:06+01:00" level=error msg="Remotedialer proxy error" error="websocket: bad handshake"

@withinboredom
Copy link
Contributor Author

From one of the other nodes, I think this is the main issue?

r: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 03 22:37:18 donkey k3s[1229182]: time="2022-02-03T22:37:18+01:00" level=warning msg="Learner capital-25aa429b stalled at RaftAppliedIndex=0 for 1m15.973650914s"
Feb 03 22:37:22 donkey k3s[1229182]: {"level":"warn","ts":"2022-02-03T22:37:22.726+0100","caller":"backend/backend.go:344","msg":"snapshotting taking too long to transfer","taking":"1m30.000285888s","bytes":77922304,"size":"78 MB"}

It seems that etcd may need longer to transfer bigger snapshots and why ~3 weeks in my environment is "long enough."

@withinboredom
Copy link
Contributor Author

@brandond this smells suspiciously like a timeout somewhere. Is there any way that you know of to configure it?

@withinboredom
Copy link
Contributor Author

Aha, there are no snapshots in the snapshot directory for the leader 🤔 so it has nothing to transfer. The original node that (re)bootstrapped the cluster, however, does have snapshots...

@brandond
Copy link
Contributor

brandond commented Feb 3, 2022

This is all internal etcd stuff that we don't manage too closely. I believe etcd takes an internal snapshot to bring the newly joined member up to speed... that said I've never seen an IO timeout while the member is streaming the initial from the peer:

Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.982+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"64f8579255e91227","remote-peer-id":"689da5af1f9c0825"}
Feb 03 21:58:41 capital k3s[6596]: {"level":"info","ts":"2022-02-03T21:58:41.986+0100","caller":"etcdserver/server.go:767","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"64f8579255e91227","forward-ticks":8,"forward-duration":"4s","election-ticks":10,"election-timeout":"5s","active-remote-members":2}
Feb 03 21:58:46 capital k3s[6596]: {"level":"warn","ts":"2022-02-03T21:58:46.964+0100","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"64f8579255e91227","remote-snapshot-sender-id":"689da5af1f9c0825","incoming-snapshot-index":11225279,"error":"read tcp 192.168.100.3:2380->192.168.100.2:25862: i/o timeout"}

Can you attach logs from all three nodes, covering the time period where you're restarting the node that fails to come back up? Is there anything unique about your configuration? Odd network topology, slow disks, unusual k3s cli flags?

@withinboredom
Copy link
Contributor Author

I'm able to reproduce it ~1/3 times, so I'm able to most likely get the first and second node in the cluster, but so far I haven't been lucky enough to get all three.

Can you attach logs from all three nodes, covering the time period where you're restarting the node that fails to come back up?

I'll get you some logs tomorrow, due to all the failing services trying to anti-affinity themselves, it's pretty noisy. I'll see if I can get them scaled to 0 for some nice clean logs.

Is there anything unique about your configuration? Odd network topology, slow disks, unusual k3s cli flags?

The only thing I can think of is that it's bare-metal with a VLAN between the nodes. Here are the k3s flags:

--no-deploy traefik --no-deploy local-storage --node-name $NODE--node-label topology.kubernetes.io/region=west-eu --node-external-ip=XX.XX.XX.XX --advertise-address=$VLAN_ADDR --node-ip=$VLAN_ADDR --flannel-iface enp9s0.4000

@withinboredom
Copy link
Contributor Author

withinboredom commented Feb 4, 2022

Both logs can be found in this gist.

"master" node was started with:

 server \
        'server' \
        '--no-deploy' \
        'traefik' \
        '--no-deploy' \
        'local-storage' \
        '--node-name' \
        'delightful' \
        '--node-label' \
        'topology.kubernetes.io/region=west-eu' \
        '--node-label' \
        'topology.kubernetes.io/zone=HEL1-DC6' \
        '--node-external-ip=X.X.X.X' \
        '--advertise-address=192.168.100.2' \
        '--node-ip=192.168.100.2' \
        '--flannel-iface' \
        'enp9s0.4000' \
        '--cluster-init' \

Second node was started with:

        'server' \
        '--no-deploy' \
        'traefik' \
        '--no-deploy' \
        'local-storage' \
        '--node-name' \
        'capital' \
        '--node-label' \
        'topology.kubernetes.io/region=west-eu' \
        '--node-label' \
        'topology.kubernetes.io/zone=HEL1-DC6' \
        '--node-external-ip=X.X.X.X' \
        '--advertise-address=192.168.100.3' \
        '--node-ip=192.168.100.3' \
        '--flannel-iface' \
        'enp8s0.4000' \
        '--server' \
        'https://192.168.100.2:6443' \

Looking at the logs, it seems like there is (maybe) an etcd bug?

"msg":"sending merged snapshot","from":"689da5af1f9c0825","to":"ee4a2ddc00ac1817","bytes":78041332,"size":"78 MB"

and then on the receiving end:

"incoming-snapshot-message-size-bytes":119028,"incoming-snapshot-message-size":"119 kB"

Those sizes are vastly different...

@brandond
Copy link
Contributor

brandond commented Feb 4, 2022

That does appear to be a bug in etcd; one node reports expecting a much smaller snapshot than the other node is prepared to send. Are these nodes both the same architecture? I can't imagine that would be it, but there must be something unique about your environment since I haven't run into this anywhere else and can't reproduce it on my end.

@withinboredom
Copy link
Contributor Author

Yeah, they're the same architecture, in fact they're almost the exact same with the only difference being the brand of networking card. I've tried doing a manual restore using etcdctl and the disaster recover instructions. It ended up taking a slightly different way to the same result.

Anyway, I did discover this in the logs when the server comes online:

"msg":"recovered v2 store from snapshot","snapshot-index":11300115,"snapshot-size":"117 kB"

So, maybe there is some v2 in the snapshot which is causing an issue, also maybe why you can't replicate it? It doesn't explain why I see this every three weeks ... but maybe every few weeks I just run into a different problem :)

@withinboredom
Copy link
Contributor Author

Hmm, there could also be some interesting bugs in how etcd calculates sizes since it up-casts to uint64, does operations and (I don't know golang too well) and then casts back to an int, which I assume will result in an overflow or some other shenanigans in some specific conditions. Granted the limit is hard-coded at 512mb, which should never reach the edge of an int...

Anyway, I don't know what to do except convert to a single node cluster with two workers OR rebuild the cluster from scratch.

@withinboredom
Copy link
Contributor Author

Wireshark to the rescue.

I noticed a number of retransmissions and other shenanigans on the link. So, I switched the cluster to use the external IP address and everything started working correctly. I don't think this is directly a k3s or etcd bug, but maybe it could be more resilient.

Thanks for helping me @brandond!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants