After a complete power off: the server is currently unable to handle the request #2249

PPCM · 2020-09-15T06:31:05Z

Environmental Info:
K3s Version: k3s version v1.19.1-rc1+k3s1 (041f18f)

Node(s) CPU architecture, OS, and Version:
Linux cluster01 5.4.0-1018-raspi #20-Ubuntu SMP Sun Sep 6 05:11:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:
3 masters

Describe the bug:
After a fresh install, all nodes were up correctly. After complete power off of all nodes, when devices come up, nodes are not able to reconnect.

Steps To Reproduce:
Installation of K3s on first master node:

curl -sfL https://get.k3s.io | K3S_CLUSTER_INIT=true INSTALL_K3S_CHANNEL=testing sh -

On 2 others masters, installation of K3s

curl -sfL https://get.k3s.io | K3S_URL=${k3s_url} K3S_TOKEN=${k3s_token} INSTALL_K3S_CHANNEL=testing bash -s server

with the correct first node URL and token

Installed K3s:
Before the shutdown, all nodes are Ready

$ sudo kubectl get nodes
NAME        STATUS   ROLES         AGE     VERSION
cluster01   Ready    etcd,master   4h47m   v1.19.1-rc1+k3s1
cluster02   Ready    etcd,master   4h46m   v1.19.1-rc1+k3s1
cluster03   Ready    etcd,master   4h46m   v1.19.1-rc1+k3s1

After the power off, here is the error on first created master node

Error from server (ServiceUnavailable): the server is currently unable to handle the request

And on other master nodes

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Expected behavior:
After a shutdown, all clusters should reconnect. At least, one of then should start correctly

Actual behavior:
Connection error to nodes

Additional context / logs:
On first master, here are cyclic messages

Sep 15 00:03:16 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49416: remote error: tls: bad certificate
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba is starting a new election at term 1287
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba became candidate at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 5b9648b5cb0fc41a at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 7e3172f2dd9d22d5 at term 1288
Sep 15 00:03:18 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39492: remote error: tls: bad certificate
Sep 15 00:03:18 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49424: remote error: tls: bad certificate
Sep 15 00:03:19 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55948: remote error: tls: bad certificate
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:20 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49432: remote error: tls: bad certificate
Sep 15 00:03:22 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49440: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39496: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49448: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: {"level":"warn","ts":"2020-09-15T00:03:24.661Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Sep 15 00:03:24 cluster01 k3s[1706]: time="2020-09-15T00:03:24.661907139Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 15 00:03:24 cluster01 k3s[1706]: WARNING: 2020/09/15 00:03:24 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Sep 15 00:03:24 cluster01 k3s[1706]: WARNING: 2020/09/15 00:03:24 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba is starting a new election at term 1288
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba became candidate at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 5b9648b5cb0fc41a at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 7e3172f2dd9d22d5 at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55952: remote error: tls: bad certificate
Sep 15 00:03:26 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49456: remote error: tls: bad certificate
Sep 15 00:03:28 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49464: remote error: tls: bad certificate
Sep 15 00:03:29 cluster01 k3s[1706]: publish error: etcdserver: request timed out
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:30 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39500: remote error: tls: bad certificate
Sep 15 00:03:30 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49472: remote error: tls: bad certificate
Sep 15 00:03:31 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55956: remote error: tls: bad certificate
Sep 15 00:03:32 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49480: remote error: tls: bad certificate

On other masters, cyclic messages

Sep 15 00:05:19 cluster02 k3s[41498]: time="2020-09-15T00:05:19.655182251Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:19 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:19 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:19 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.
Sep 15 00:05:24 cluster02 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1579.
Sep 15 00:05:24 cluster02 systemd[1]: Stopped Lightweight Kubernetes.
Sep 15 00:05:24 cluster02 systemd[1]: Starting Lightweight Kubernetes...
Sep 15 00:05:25 cluster02 k3s[41527]: time="2020-09-15T00:05:25.507939367Z" level=info msg="Starting k3s v1.19.1-rc1+k3s1 (041f18f6)"
Sep 15 00:05:25 cluster02 k3s[41527]: time="2020-09-15T00:05:25.666961273Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:25 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:25 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:25 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.
Sep 15 00:05:30 cluster02 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1580.
Sep 15 00:05:30 cluster02 systemd[1]: Stopped Lightweight Kubernetes.
Sep 15 00:05:30 cluster02 systemd[1]: Starting Lightweight Kubernetes...
Sep 15 00:05:31 cluster02 k3s[41549]: time="2020-09-15T00:05:31.512641858Z" level=info msg="Starting k3s v1.19.1-rc1+k3s1 (041f18f6)"
Sep 15 00:05:31 cluster02 k3s[41549]: time="2020-09-15T00:05:31.671164367Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:31 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:31 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:31 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.

The text was updated successfully, but these errors were encountered:

PPCM · 2020-09-17T09:26:29Z

Nobody has this issue?
Do I missed something ??

onedr0p · 2020-09-17T22:48:55Z

Looks like we're the beta testers @PPCM ;)

I have this same issue with 1.19.1-k3s1

k get nodes -o wide
NAME           STATUS   ROLES         AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-master-a   Ready    etcd,master   51s   v1.19.1+k3s1   192.168.42.100   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1
k8s-master-b   Ready    etcd,master   36s   v1.19.1+k3s1   192.168.42.101   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1
k8s-master-c   Ready    etcd,master   25s   v1.19.1+k3s1   192.168.42.102   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1

After reboot and even waiting 15+ minutes:

k get nodes -o wide
Error from server (ServiceUnavailable): the server is currently unable to handle the request

Output of journalctl --no-pager -u k3s.service -b

master-a logs: k3s-a.log
master-b logs: k3s-b.log
master-c logs: k3s-c.log

brandond · 2020-09-18T02:15:23Z

Thanks for the feedback y'all! We'll investigate this as soon as possible. We expected there might be a few corner cases with the embedded etcd cluster which is why it's still experimental, but it should stabilize soon.

onedr0p · 2020-09-18T15:41:37Z

I redeployed the master nodes and this time with --debug turned on, logs are below are after a reboot

k3s-a.log
k3s-b.log
k3s-c.log

192.168.42.99 is my keepalived VIP which has the following configuration:

global_defs {
    router_id LVS_DEVEL
}

vrrp_script chk_k3s_server {
    script "killall -0 k3s-server"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eno1
    virtual_router_id 1
    priority 150
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass kubernetes
    }
    virtual_ipaddress {
        192.168.42.99/24
    }
    track_script {
        chk_k3s_server
    }
}

I also noticed that when I only reboot the first master node (a), once it comes back up and k3s-server has started I get the same issue Error from server (ServiceUnavailable): the server is currently unable to handle the request. It took me running systemctl restart k3s on that node to bring it back up. Full reboot still borks it all though.

brandond · 2020-09-18T16:54:38Z

Does this work without the keepalived VIP?

Also, Pi's have some odd issues with systemd starting things before the network is actually up. Do you see any change in behavior if you modify your k3s systemd unit as described here: https://github.com/rancher/k3s/pull/2210/files

onedr0p · 2020-09-18T16:59:15Z

I will report back on removing keepalived and seeing if I experience the same issue. No RasPis here, I am using Intel NUCs, also my systemd file already had After=network.target in it.

brandond · 2020-09-18T17:11:13Z

@onedr0p if you're not on the same platform as this issue can you open a new issue and fill out the template? It's hard to track affected system configurations when people me-too on other issues without providing all the info.

onedr0p · 2020-09-18T17:24:49Z

Removing the keepalived VIP didn't work, meaning the master nodes still do not come back online after a reboot.

@brandond Apologies for not including my hardware specs initially, but these issues are too identical, no? I would expect a fix for one architecture would be a fix for the other architecture.

brandond · 2020-09-18T17:28:01Z

That depends on what the root cause is. Different OS distribution and different hardware merit a new issue.

PPCM · 2020-09-18T23:35:51Z

I made a test with the new RC

cluster01   Ready    etcd,master   83m   v1.19.2-rc2+k3s1
cluster02   Ready    etcd,master   82m   v1.19.2-rc2+k3s1
cluster03   Ready    etcd,master   81m   v1.19.2-rc2+k3s1

The result is the same, the cluster is still down. But logs are different:

Sep 19 01:31:42 cluster01 k3s[2164]: time="2020-09-19T01:31:42.947291646+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.13:49048: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.000036202+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.12:54066: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.418649172+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:43 http: TLS handshake error from 192.168.10.14:47548: remote error: tls: bad certificate"
Sep 19 01:31:45 cluster01 k3s[2164]: time="2020-09-19T01:31:45.438618592+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:45 http: TLS handshake error from 192.168.10.14:47556: remote error: tls: bad certificate"
Sep 19 01:31:46 cluster01 k3s[2164]: {"level":"warn","ts":"2020-09-19T01:31:46.046+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Sep 19 01:31:46 cluster01 k3s[2164]: time="2020-09-19T01:31:46.047078765+02:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 19 01:31:46 cluster01 k3s[2164]: WARNING: 2020/09/19 01:31:46 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba is starting a new election at term 60
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba became candidate at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to 3928f76ae7bbea7b at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to a1da96e589c651ed at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: time="2020-09-19T01:31:47.456692788+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:47 http: TLS handshake error from 192.168.10.14:47564: remote error: tls: bad certificate"
Sep 19 01:31:48 cluster01 k3s[2164]: time="2020-09-19T01:31:48.954812267+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:48 http: TLS handshake error from 192.168.10.13:49052: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.011361853+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.12:54070: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.476812173+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.14:47572: remote error: tls: bad certificate"
Sep 19 01:31:51 cluster01 k3s[2164]: publish error: etcdserver: request timed out
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused

brandond · 2020-09-19T00:10:13Z

Yes, it appears to be deadlocked. Node 1 is waiting for 2 and 3 to come up before etcd will start, as a single node does not have quorum. Nodes 2 and 3 are waiting for node 1 to come up so that they can bootstrap the certs, and can't proceed past that to start etcd. I'm not sure why nodes 2 and 3 are trying to load certs again though, as this should have been completed when they came up the first time.

cjellick · 2020-09-19T03:41:03Z

This is definitely a blocker for GAing etcd support. We’ll figure it out

cjellick · 2020-09-19T03:41:30Z

Cc @davidnuzik

brandond · 2020-09-19T08:09:36Z

On the primary server, this happens because we block starting the http listener until the datastore finishes starting. Obviously the datastore can't come up until quorum is reached, which requires additional servers.

On the secondary servers this happens because we try to validate the client token here:
https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L43
before checking to see if we've already bootstrapped here:
https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L56
all of which have to complete before we start the datastore (etcd).

The workaround is to remove K3S_TOKEN and K3S_URL from /etc/systemd/system/k3s.service.env on the secondary servers after the first startup so it doesn't try to bootstrap again.

The real fix is to either correct the check order, or start up the http listener earlier so that it can respond to bootstrap requests before etcd is up. The first option is probably simpler, assuming we can handle any side effects of switching the checks.

PPCM · 2020-09-19T08:57:47Z

I tried the workaround and it works fine. Thanks!
Do you think the real fix will be done in the next release?

onedr0p · 2020-09-20T15:10:19Z

We're all so eager to get this to work :D

@brandond I am doing a different method of install (https://github.com/PyratLabs/ansible-role-k3s), which includes the flags in the systemd files. I am still unable to make the cluster come back after reboot. Maybe you can see why looking at my systemd files?

ansible all -i inventory/custom/hosts.yml -b -m shell -a "cat /etc/systemd/system/k3s.service"


k8s-master-a | CHANGED | rc=0 >>


[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-init --tls-san 192.168.42.19 --node-ip 192.168.42.20 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target


k8s-master-c | CHANGED | rc=0 >>


[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target


k8s-master-b | CHANGED | rc=0 >>


[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.21 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.targe

Also random question, if we label a worker node as etcd, will it also be added to the quorum? Or it that specifically for master nodes?

brandond · 2020-09-20T20:58:39Z

@onedr0p in that case you will need to remove the --server and --token flags from the server command line.

Only servers (nodes started with k3s server) run the apiserver and datastore (etcd). Adding labels to agents won't make them a server.

onedr0p · 2020-09-21T12:24:54Z

@brandond I removed --server and --token from one of the secondary masters. Then issued sudo systemctl daemon-reload and then sudo systemctl restart k3s. The service came back healthy and the node was still working. However on reboot of this node it did not want to join back.

Systemd service:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Here is the logs for that node after reboot

Sep 21 08:16:38 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.429345552-04:00" level=info msg="Starting k3s v1.19.2-rc2+k3s1 (d38505b1)"
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.447441699-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 08:16:38 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:38 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:40 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:40 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:41 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:41 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:43 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:43 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:44 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:44 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:47 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: {"level":"warn","ts":"2020-09-21T08:16:48.447-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 08:16:48 k8s-master-c k3s[521]: time="2020-09-21T08:16:48.447933340-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:54 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:54 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:55 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:55 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:58 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:58 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:02 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:03 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

At least it's now not complaining about certificates :)

Edit: I may try to do a manual install to see if this can be replicated. However reviewing the ansible role (PyratLabs/ansible-role-k3s) I do not see anything that stands out as to why it would not work.

brandond · 2020-09-21T17:37:16Z

@onedr0p on k8s-master-c, can you change your command from k3s server to k3s --debug server? It looks like etcd is not finishing startup for some reason. Did you do anything to the files on disk?

onedr0p · 2020-09-21T17:53:40Z

Sure here you go, the systemd file is the same but with --debug added, here is the log after reboot.

-- Logs begin at Mon 2020-09-21 10:32:02 EDT, end at Mon 2020-09-21 13:52:32 EDT. --
Sep 21 13:51:47 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Asset dir /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Running /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8/bin/k3s-server [/usr/local/bin/k3s --debug server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-dns 10.43.0.10 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true]"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.400486793-04:00" level=info msg="Starting k3s v1.19.2+k3s1 (d38505b1)"
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.419057703-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:49 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:49 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:50 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:50 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:51 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:51 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:52 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:52 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:56 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:56 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:57 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:51:57.419-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:51:57 k8s-master-c k3s[526]: time="2020-09-21T13:51:57.420272065-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:51:57 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:57 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:02 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:04 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:04 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:07 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:07 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:11 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:11 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:12 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:12.420-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:12 k8s-master-c k3s[526]: time="2020-09-21T13:52:12.420999478-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:12 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:12 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:17 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:17 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:18 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:18 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:20 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:20 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:22 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:22 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:26 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:26 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:27 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:27.421-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:27 k8s-master-c k3s[526]: time="2020-09-21T13:52:27.421641003-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:29 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:29 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:32 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:32 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

Edit 1: it appears after I run sudo systemctl restart k3s the service becomes active. Maybe there is some race condition or something happening on boot.

Edit 2: it appears for my case that my systemd file was After=network.service and NOT After=network-online.service

Sorry for missing that 🤦 The node now comes back healthy after removing --token and --server flags and a reboot. 🥳 Thanks for the help!

ShylajaDevadiga · 2020-10-14T21:34:05Z

Reproduced the issue using v1.19.1-rc1+k3s1 and validated the fix using commit ID 6b11d86.

Created cluster with 3 server nodes. Rebooted all nodes. All nodes were able to reconnect.
Commands used:
Server 1:

curl -sfL https://get.k3s.io | K3S_CLUSTER_INIT=true INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -

Server 2 and 3:

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -s - server --token=TOKEN --server=https://FIRSTNODEIP:6443

After reboot:

kubectl get nodes 
NAME               STATUS   ROLES         AGE   VERSION
ip-172-31-10-253   Ready    etcd,master   13m   v1.19.2+k3s-6b11d860
ip-172-31-11-247   Ready    etcd,master   13m   v1.19.2+k3s-6b11d860
ip-172-31-5-59     Ready    etcd,master   16m   v1.19.2+k3s-6b11d860

mamiu · 2020-10-17T06:18:21Z

@ShylajaDevadiga @brandond

Is there a reason why this fix (the commit ID from @ShylajaDevadiga's comment - which is indeed solving this bug) is not included in the latest v1.19.3+k3s1 release?
This issue has the milestone v1.19.3+k3s1 and is listed as closed, but it's not included in the latest release channel (which is currently v1.19.3+k3s1) and therefore an installation with INSTALL_K3S_CHANNEL=latest still has this issue.

brandond · 2020-10-17T07:06:06Z

Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.

We should ignore --token and --server if the managed database is initialized, just like we ignore --cluster-init. If the user wants to join a new cluster, or rejoin a cluster after --cluster-reset, they need to delete the database. This a cleaner way to prevent deadlocking on quorum loss, and removes the requirement that the target of the --server argument must be online before already joined nodes can start. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

dharmendrakariya · 2020-11-23T10:23:13Z

Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.

Hi @brandond , I am using this setup Exercise I Performed , PS: I am using K3Os in Node2 and Node3, and I am facing this issue. After reboot the setup is not coming up. Can you suggest any particular for this case.

brandond · 2020-11-23T17:07:04Z

@dharmendrakariya this issue was fixed in the PR linked above. Please confirm that you are running a release that includes the fix.

dharmendrakariya · 2020-11-23T17:10:00Z

@brandond okay great, but I think K3os doesn't have this patch. Because today only I faced this one.

brandond · 2020-11-23T17:15:01Z

Ah, in that case you should probably open an issue in the k3os repo so that it can be tracked there.

dharmendrakariya · 2020-11-23T17:18:07Z

Oops my bad, will do that. You are right!

We should ignore --token and --server if the managed database is initialized, just like we ignore --cluster-init. If the user wants to join a new cluster, or rejoin a cluster after --cluster-reset, they need to delete the database. This a cleaner way to prevent deadlocking on quorum loss, and removes the requirement that the target of the --server argument must be online before already joined nodes can start. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

brandond added [zube]: Needs Attention area/etcd labels Sep 17, 2020

onedr0p mentioned this issue Sep 18, 2020

Unable to bring cluster up after rebooting using v1.19.1-k3s1 w/ embedded etcd #2281

Closed

brandond self-assigned this Sep 19, 2020

brandond added [zube]: Next Up and removed [zube]: Needs Attention labels Sep 19, 2020

davidnuzik added this to the v1.19.2+k3s2 milestone Sep 21, 2020

davidnuzik added the kind/bug Something isn't working label Sep 21, 2020

brandond added [zube]: Working and removed [zube]: Next Up labels Sep 21, 2020

davidnuzik modified the milestones: v1.19.2+k3s2, v1.19.3+k3s1 Oct 6, 2020

brandond added [zube]: To Test and removed [zube]: Peer Review labels Oct 12, 2020

davidnuzik assigned ShylajaDevadiga Oct 14, 2020

ShylajaDevadiga closed this as completed Oct 14, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Oct 14, 2020

brandond reopened this Oct 17, 2020

zube bot added [zube]: To Triage and removed [zube]: Done labels Oct 17, 2020

brandond closed this as completed Oct 17, 2020

zube bot added [zube]: Done and removed [zube]: To Triage labels Oct 17, 2020

mamiu mentioned this issue Oct 18, 2020

k3s flannel ipsec backend no connectivity between pods on different nodes #1613

Closed

dharmendrakariya mentioned this issue Nov 24, 2020

After a complete power off: the server is currently unable to handle the request rancher/k3os#613

Closed

dweomer mentioned this issue Dec 10, 2020

[release-1.19] k3s-upgrage dispatch k3s-io #2680

Merged

davidnuzik removed the [zube]: Done label Jan 15, 2021

brandond mentioned this issue Dec 14, 2021

Unable to start secondary etcd nodes if initial cluster member is offline #4746

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After a complete power off: the server is currently unable to handle the request #2249

After a complete power off: the server is currently unable to handle the request #2249

PPCM commented Sep 15, 2020

PPCM commented Sep 17, 2020

onedr0p commented Sep 17, 2020 •

edited

Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 •

edited

Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 •

edited

Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 •

edited

Loading

brandond commented Sep 18, 2020

PPCM commented Sep 18, 2020

brandond commented Sep 19, 2020 •

edited

Loading

cjellick commented Sep 19, 2020

cjellick commented Sep 19, 2020

brandond commented Sep 19, 2020 •

edited

Loading

PPCM commented Sep 19, 2020 •

edited

Loading

onedr0p commented Sep 20, 2020 •

edited

Loading

brandond commented Sep 20, 2020 •

edited

Loading

onedr0p commented Sep 21, 2020 •

edited

Loading

brandond commented Sep 21, 2020 •

edited

Loading

onedr0p commented Sep 21, 2020 •

edited

Loading

ShylajaDevadiga commented Oct 14, 2020

mamiu commented Oct 17, 2020 •

edited

Loading

brandond commented Oct 17, 2020 •

edited

Loading

dharmendrakariya commented Nov 23, 2020

brandond commented Nov 23, 2020

dharmendrakariya commented Nov 23, 2020

brandond commented Nov 23, 2020

dharmendrakariya commented Nov 23, 2020

After a complete power off: the server is currently unable to handle the request #2249

After a complete power off: the server is currently unable to handle the request #2249

Comments

PPCM commented Sep 15, 2020

PPCM commented Sep 17, 2020

onedr0p commented Sep 17, 2020 • edited Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 • edited Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 • edited Loading

brandond commented Sep 18, 2020

onedr0p commented Sep 18, 2020 • edited Loading

brandond commented Sep 18, 2020

PPCM commented Sep 18, 2020

brandond commented Sep 19, 2020 • edited Loading

cjellick commented Sep 19, 2020

cjellick commented Sep 19, 2020

brandond commented Sep 19, 2020 • edited Loading

PPCM commented Sep 19, 2020 • edited Loading

onedr0p commented Sep 20, 2020 • edited Loading

brandond commented Sep 20, 2020 • edited Loading

onedr0p commented Sep 21, 2020 • edited Loading

brandond commented Sep 21, 2020 • edited Loading

onedr0p commented Sep 21, 2020 • edited Loading

ShylajaDevadiga commented Oct 14, 2020

mamiu commented Oct 17, 2020 • edited Loading

brandond commented Oct 17, 2020 • edited Loading

dharmendrakariya commented Nov 23, 2020

brandond commented Nov 23, 2020

dharmendrakariya commented Nov 23, 2020

brandond commented Nov 23, 2020

dharmendrakariya commented Nov 23, 2020

onedr0p commented Sep 17, 2020 •

edited

Loading

onedr0p commented Sep 18, 2020 •

edited

Loading

onedr0p commented Sep 18, 2020 •

edited

Loading

onedr0p commented Sep 18, 2020 •

edited

Loading

brandond commented Sep 19, 2020 •

edited

Loading

brandond commented Sep 19, 2020 •

edited

Loading

PPCM commented Sep 19, 2020 •

edited

Loading

onedr0p commented Sep 20, 2020 •

edited

Loading

brandond commented Sep 20, 2020 •

edited

Loading

onedr0p commented Sep 21, 2020 •

edited

Loading

brandond commented Sep 21, 2020 •

edited

Loading

onedr0p commented Sep 21, 2020 •

edited

Loading

mamiu commented Oct 17, 2020 •

edited

Loading

brandond commented Oct 17, 2020 •

edited

Loading