Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"k0s start" on worker node fails with "Error: service in failed state" #1638

Open
4 tasks done
sebthom opened this issue Apr 6, 2022 · 16 comments
Open
4 tasks done
Labels

Comments

@sebthom
Copy link

sebthom commented Apr 6, 2022

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Version

v1.23.5+k0s.0

Platform

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

What happened?

I followed the multi node setup described at https://docs.k0sproject.io/v1.23.5+k0s.0/k0s-multi-node/

Setup of the controller node worked as described. However when trying to start a worker node the following error message appears without any further information:

$ sudo k0s start
Error: service in failed state
Usage:
  k0s start [flags]

Flags:
  -h, --help   help for start

Global Flags:
      --data-dir string   Data Directory for k0s (default: /var/lib/k0s). DO NOT CHANGE for an existing setup, things will break!
      --debug             Debug logging (default: false)

Steps to reproduce

  1. On controller node execute:
    sudo su - root
    curl -sSLf https://get.k0s.sh | sh
    k0s install controller
    k0s start
    sleep 5
    k0s status
    k0s token create --role=worker --expiry=1h > /tmp/worker-join-token
  2. Copy /tmp/worker-join-token from controller to worker node
  3. On worker node execute:
    sudo su - root
    curl -sSLf https://get.k0s.sh | sh 
    k0s install worker --token-file /tmp/worker-join-token
    k0s start

Expected behavior

The worker node gets federated into the controller and the k0s service on the worker node starts without failures.

Actual behavior

Starting k0s on the worker node fails with Error: service in failed state

Screenshots and logs

No response

Additional context

$ k0s sysinfo
KERNEL_VERSION: 5.4.0-107-generic
CONFIG_INET: enabled
CONFIG_NETFILTER_XT_TARGET_REDIRECT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled (as module)
CONFIG_NAMESPACES: enabled
CONFIG_UTS_NS: enabled
CONFIG_IPC_NS: enabled
CONFIG_PID_NS: enabled
CONFIG_NET_NS: enabled
CONFIG_CGROUPS: enabled
CONFIG_CGROUP_FREEZER: enabled
CONFIG_CGROUP_PIDS: enabled
CONFIG_CGROUP_DEVICE: enabled
CONFIG_CPUSETS: enabled
CONFIG_CGROUP_CPUACCT: enabled
CONFIG_MEMCG: enabled
CONFIG_CGROUP_SCHED: enabled
CONFIG_FAIR_GROUP_SCHED: enabled
CONFIG_EXT4_FS: enabled
CONFIG_PROC_FS: enabled
CONFIG_OVERLAY_FS: enabled (as module)
CONFIG_BLK_DEV_DM: enabled
CONFIG_CFS_BANDWIDTH: enabled
CONFIG_CGROUP_HUGETLB: enabled
CONFIG_SECCOMP: enabled
CONFIG_SECCOMP_FILTER: enabled
CONFIG_BRIDGE: enabled (as module)
CONFIG_IP6_NF_FILTER: enabled (as module)
CONFIG_IP6_NF_IPTABLES: enabled (as module)
CONFIG_IP6_NF_MANGLE: enabled (as module)
CONFIG_IP6_NF_NAT: enabled (as module)
CONFIG_IP_NF_FILTER: enabled (as module)
CONFIG_IP_NF_IPTABLES: enabled (as module)
CONFIG_IP_NF_MANGLE: enabled (as module)
CONFIG_IP_NF_NAT: enabled (as module)
CONFIG_IP_NF_TARGET_REJECT: enabled (as module)
CONFIG_IP_SET: enabled (as module)
CONFIG_IP_SET_HASH_IP: enabled (as module)
CONFIG_IP_SET_HASH_NET: enabled (as module)
CONFIG_IP_VS_NFCT: enabled
CONFIG_LLC: enabled (as module)
CONFIG_NETFILTER_NETLINK: enabled (as module)
CONFIG_NETFILTER_XTABLES: enabled (as module)
CONFIG_NETFILTER_XT_MARK: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_MULTIPORT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_RECENT: enabled (as module)
CONFIG_NETFILTER_XT_MATCH_STATISTIC: enabled (as module)
CONFIG_NETFILTER_XT_NAT: enabled (as module)
CONFIG_NETFILTER_XT_SET: enabled (as module)
CONFIG_NETFILTER_XT_TARGET_MASQUERADE: enabled (as module)
CONFIG_NF_CONNTRACK: enabled (as module)
CONFIG_NF_DEFRAG_IPV4: enabled (as module)
CONFIG_NF_DEFRAG_IPV6: enabled (as module)
CONFIG_NF_NAT: enabled (as module)
CONFIG_NF_REJECT_IPV4: enabled (as module)
CONFIG_STP: enabled (as module)
OS: Linux
CGROUPS_CPU: enabled
CGROUPS_CPUACCT: enabled
CGROUPS_CPUSET: enabled
CGROUPS_DEVICES: enabled
CGROUPS_FREEZER: enabled
CGROUPS_MEMORY: enabled
CGROUPS_PIDS: enabled
CGROUPS_HUGETLB: enabled
@sebthom sebthom added the bug Something isn't working label Apr 6, 2022
@odidev
Copy link

odidev commented Apr 7, 2022

Hi Team,

I am also facing this issue while executing "k0s start" in the worker node.
Even "k0s stop" and "k0s delete" commands are also showing the same error after executing "k0s install worker" in the worker node. So I am unable to stop and reset k0s service.

Can you please provide some pointers?

@twz123
Copy link
Member

twz123 commented Apr 8, 2022

The error message stems from the fact that the k0s systemd service is in a failed state. The k0s start command looks for the installed service and checks the status of the service to check if it's actually installed. That's where the "failed" error slips through. K0s should probably treat that case a bit differently in the start/stop/delete subcommands.

Can you try to start k0s manually via systemctl? Does it work then?

@sebthom
Copy link
Author

sebthom commented Apr 8, 2022

Apparently in my case an extra new line char was added accidentally to the token file on the worker nodes which prevented them from joining the controller. After I fixed this it works now.

@sebthom sebthom closed this as completed Apr 8, 2022
@twz123
Copy link
Member

twz123 commented Apr 8, 2022

Thanks for the feedback. Even if the root cause in your case was some configuration error, I'll be reopening this, since there's definitely something to be improved here on k0s side.

@twz123 twz123 reopened this Apr 8, 2022
@jnummelin
Copy link
Collaborator

What if we check the service state in k0s start and if we detect it failing we could print that info out to user with some help context like Service failed to start, check the logs with journalctl ...?

@jnummelin jnummelin added enhancement New feature or request area/cli area/install and removed bug Something isn't working labels Apr 11, 2022
@odidev
Copy link

odidev commented Apr 11, 2022

@twz123 , thank you for reopening the ticket.

As @sebthom said, I checked my token-file and there was no extra character added. Yet, I am still getting the same issue. FYI, I am working on AWS x64 ubuntu instance as a controller node and another AWS Ubuntu x64 instance as a worker node.

Also, as mentioned above, I checked the systemctl list on the worker node, the k0sworker.service was found failing. I restarted the service manually using systemctl, but that did not resolve the issue.

Also, sudo k0s kubectl get nodes shows "no resources found" on both the nodes - controller and worker.

@jnummelin
Copy link
Collaborator

k0sworker.service was found failing

As the service is failing, the logs probably contain some hints why it fails. Check with journalctl -u k0sworker.service ...

@odidev
Copy link

odidev commented Jul 27, 2022

Hi Team, I again followed the manual installation for k0s.

I successfully created the controller node on an AWS Ubuntu instance and created another worker node on another AWS instance, using the join token created in the controller node.

FYI: I needed to include ‘--enable-worker’ flag along with k0s install controller command.

While I execute k0s kubectl get nodes in the controller node, below output is received:

NAME                                         STATUS   ROLES           AGE     VERSION 
ip-172-31-5-166.us-east-2.compute.internal   Ready    control-plane   2m52s   v1.24.2+k0s 

This is OK as per the documents.

But my understanding says that the k0s kubectl get nodes command should also display the information of the worker node, since worker node is successfully created using the join token.

May I know, am I correct in my understanding, or is the above output an expected behavior?

@jnummelin
Copy link
Collaborator

May I know, am I correct in my understanding, or is the above output an expected behavior?

if you have another instance running with k0s worker it should appear in the node list. So something is off on the pure worker node.

I would advise to check the logs on the worker node using something like sudo journalctl -u k0sworker. That should hopefully show some light as of why the worker is not able to connect with the controller.

As this is AWS infra, is there security groups configured in a way that allows the two nodes to properly connect with eachother? I'd start by enabling full allow within the SG where both nodes are in.

@odidev
Copy link

odidev commented Jul 27, 2022

I even tried k0s installation on the local servers. The result is the same as on AWS.

journalctl -u k0sworker’ command showed as below:

Jul 27 11:39:58 ip-172-31-19-8 k0s[8785]: time="2022-07-27 11:39:58" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Get "https://172.31.46.24:6443/api/v1/namespaces/kube-system/configmaps/kubelet-config-default-1.24\": dial tcp 172.31.46.24:6443: i/o timeout"

This means, workers were able to get the controller’s IP, but could not connect to it, or could not read some config information.

I checked that my k0s.yml file already has the public IP of my controller, and now I am working on the local servers, so security group is no longer an issue.

Can you please guide me what next, I can check?

@jnummelin
Copy link
Collaborator

So clearly the worker node cannot connect to the controllers IP. Few things I'd check:

  • is the apiserver on controller listening properly? i.e. can you access the API (e.g. via curl) locally on the controller:
    • curl -k https://localhost:6443
    • curl -k https://<controller IP>:6443; controller IP being here the one you'd expect workers to connect to
  • is the token having the expected address? You can decode the token with cat k0s.token | base64 -d | gunzip
  • can you actually connect from worker to controller? test e.g. with netcat:
    • on controller, run netcat -l 4444
    • on worker, run curl <controller IP>:4444; you should see some http headers being received on controller

@odidev
Copy link

odidev commented Jul 29, 2022

Thank you for the suggestions.

I curled the controller from worker and got the below results:

{ 
  "kind": "Status", 
  "apiVersion": "v1", 
  "metadata": {}, 
  "status": "Failure", 
  "message": "Unauthorized", 
  "reason": "Unauthorized", 
  "code": 401 
} 

Connection failed. Also, decoding the join token showed the correct IP address of the controller.

And with the netcat command, again connection timed out while curl, since there was no response on the controller side.

I tried running nginx service on the controller machine on port 80 and curled controller on port 80 (curl http://IP:80) from the worker machine. The connection is successful. I’m not sure why it’s failing with k0s.

I am reading this document and found that we need to configure firewall to outbound access port 6443 and 8132. I did that as follows:

iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 6443 -j ACCEPT 
iptables -A OUTPUT -p tcp -d <controller’s IP> --dport 8132 -j ACCEPT 

But nothing really affected.

@jnummelin
Copy link
Collaborator

And with the netcat command, again connection timed out while curl, since there was no response on the controller side.

So, from worker, you can curl <controller IP>:6443 but it fails with netcat? Sounds truly bizarre curl works but nothing else works. Possible that there's something like SELinux preventing the connections etc?

@till
Copy link
Contributor

till commented Nov 8, 2023

I just ran into a similar problem:

I rebooted all my worker nodes at the same time (to see what would happen in case there is some kind of failure). Each worker is now stuck:

core@node-003 ~ $ sudo journalctl -u k0sworker --follow
Nov 08 14:01:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:01:08 node-003.prod k0s[1583]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:01:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Scheduled restart job, restart counter is at 3.
Nov 08 14:03:08 node-003.prod systemd[1]: Stopped k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod systemd[1]: Started k0sworker.service - k0s - Zero Friction Kubernetes.
Nov 08 14:03:08 node-003.prod k0s[1701]: Error: failed to decode join token: illegal base64 data at input byte 0
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
Nov 08 14:03:08 node-003.prod systemd[1]: k0sworker.service: Failed with result 'exit-code'.

I also tried to run k0sctl apply again, but that complains about the unit being there but not started.

@till
Copy link
Contributor

till commented Nov 8, 2023

The join token is empty:

core@node-003 ~ $ sudo cat /etc/k0s/k0stoken 
# overwritten by k0sctl after join

But I am also not sure why it would need the join token after a (simple) reboot?

@till
Copy link
Contributor

till commented Nov 8, 2023

Managed to recover with k0sctl apply --force. Not entirely sure why that was necessary though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants