-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After a complete power off: the server is currently unable to handle the request #2249
Comments
Nobody has this issue? |
Looks like we're the beta testers @PPCM ;) I have this same issue with 1.19.1-k3s1
After reboot and even waiting 15+ minutes:
Output of master-a logs: k3s-a.log |
Thanks for the feedback y'all! We'll investigate this as soon as possible. We expected there might be a few corner cases with the embedded etcd cluster which is why it's still experimental, but it should stabilize soon. |
I redeployed the master nodes and this time with 192.168.42.99 is my keepalived VIP which has the following configuration:
I also noticed that when I only reboot the first master node (a), once it comes back up and k3s-server has started I get the same issue |
Does this work without the keepalived VIP? Also, Pi's have some odd issues with systemd starting things before the network is actually up. Do you see any change in behavior if you modify your k3s systemd unit as described here: https://github.com/rancher/k3s/pull/2210/files |
I will report back on removing keepalived and seeing if I experience the same issue. No RasPis here, I am using Intel NUCs, also my systemd file already had |
@onedr0p if you're not on the same platform as this issue can you open a new issue and fill out the template? It's hard to track affected system configurations when people me-too on other issues without providing all the info. |
Removing the keepalived VIP didn't work, meaning the master nodes still do not come back online after a reboot. @brandond Apologies for not including my hardware specs initially, but these issues are too identical, no? I would expect a fix for one architecture would be a fix for the other architecture. |
That depends on what the root cause is. Different OS distribution and different hardware merit a new issue. |
I made a test with the new RC
The result is the same, the cluster is still down. But logs are different:
|
Yes, it appears to be deadlocked. Node 1 is waiting for 2 and 3 to come up before etcd will start, as a single node does not have quorum. Nodes 2 and 3 are waiting for node 1 to come up so that they can bootstrap the certs, and can't proceed past that to start etcd. I'm not sure why nodes 2 and 3 are trying to load certs again though, as this should have been completed when they came up the first time. |
This is definitely a blocker for GAing etcd support. We’ll figure it out |
Cc @davidnuzik |
On the primary server, this happens because we block starting the http listener until the datastore finishes starting. Obviously the datastore can't come up until quorum is reached, which requires additional servers. On the secondary servers this happens because we try to validate the client token here: The workaround is to remove K3S_TOKEN and K3S_URL from /etc/systemd/system/k3s.service.env on the secondary servers after the first startup so it doesn't try to bootstrap again. The real fix is to either correct the check order, or start up the http listener earlier so that it can respond to bootstrap requests before etcd is up. The first option is probably simpler, assuming we can handle any side effects of switching the checks. |
I tried the workaround and it works fine. Thanks! |
We're all so eager to get this to work :D @brandond I am doing a different method of install (https://github.com/PyratLabs/ansible-role-k3s), which includes the flags in the systemd files. I am still unable to make the cluster come back after reboot. Maybe you can see why looking at my systemd files? ansible all -i inventory/custom/hosts.yml -b -m shell -a "cat /etc/systemd/system/k3s.service"
k8s-master-a | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-init --tls-san 192.168.42.19 --node-ip 192.168.42.20 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
k8s-master-c | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
k8s-master-b | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.21 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targe Also random question, if we label a worker node as etcd, will it also be added to the quorum? Or it that specifically for master nodes? |
@onedr0p in that case you will need to remove the --server and --token flags from the server command line. Only servers (nodes started with |
@brandond I removed Systemd service:
Here is the logs for that node after reboot
At least it's now not complaining about certificates :) Edit: I may try to do a manual install to see if this can be replicated. However reviewing the ansible role (PyratLabs/ansible-role-k3s) I do not see anything that stands out as to why it would not work. |
@onedr0p on k8s-master-c, can you change your command from |
Sure here you go, the systemd file is the same but with
Edit 1: it appears after I run Edit 2: it appears for my case that my systemd file was Sorry for missing that 🤦 The node now comes back healthy after removing |
Reproduced the issue using v1.19.1-rc1+k3s1 and validated the fix using commit ID 6b11d86. Created cluster with 3 server nodes. Rebooted all nodes. All nodes were able to reconnect.
Server 2 and 3:
After reboot:
|
Is there a reason why this fix (the commit ID from @ShylajaDevadiga's comment - which is indeed solving this bug) is not included in the latest v1.19.3+k3s1 release? |
Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch. |
We should ignore --token and --server if the managed database is initialized, just like we ignore --cluster-init. If the user wants to join a new cluster, or rejoin a cluster after --cluster-reset, they need to delete the database. This a cleaner way to prevent deadlocking on quorum loss, and removes the requirement that the target of the --server argument must be online before already joined nodes can start. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
We should ignore --token and --server if the managed database is initialized, just like we ignore --cluster-init. If the user wants to join a new cluster, or rejoin a cluster after --cluster-reset, they need to delete the database. This a cleaner way to prevent deadlocking on quorum loss, and removes the requirement that the target of the --server argument must be online before already joined nodes can start. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Hi @brandond , I am using this setup Exercise I Performed , PS: I am using K3Os in Node2 and Node3, and I am facing this issue. After reboot the setup is not coming up. Can you suggest any particular for this case. |
@dharmendrakariya this issue was fixed in the PR linked above. Please confirm that you are running a release that includes the fix. |
@brandond okay great, but I think K3os doesn't have this patch. Because today only I faced this one. |
Ah, in that case you should probably open an issue in the k3os repo so that it can be tracked there. |
Oops my bad, will do that. You are right! |
We should ignore --token and --server if the managed database is initialized, just like we ignore --cluster-init. If the user wants to join a new cluster, or rejoin a cluster after --cluster-reset, they need to delete the database. This a cleaner way to prevent deadlocking on quorum loss, and removes the requirement that the target of the --server argument must be online before already joined nodes can start. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Environmental Info:
K3s Version: k3s version v1.19.1-rc1+k3s1 (041f18f)
Node(s) CPU architecture, OS, and Version:
Linux cluster01 5.4.0-1018-raspi #20-Ubuntu SMP Sun Sep 6 05:11:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
Cluster Configuration:
3 masters
Describe the bug:
After a fresh install, all nodes were up correctly. After complete power off of all nodes, when devices come up, nodes are not able to reconnect.
Steps To Reproduce:
Installation of K3s on first master node:
On 2 others masters, installation of K3s
with the correct first node URL and token
Before the shutdown, all nodes are Ready
After the power off, here is the error on first created master node
And on other master nodes
Expected behavior:
After a shutdown, all clusters should reconnect. At least, one of then should start correctly
Actual behavior:
Connection error to nodes
Additional context / logs:
On first master, here are cyclic messages
On other masters, cyclic messages
The text was updated successfully, but these errors were encountered: