Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to connect agent to master #1523

Closed
cvernooy23 opened this issue Mar 12, 2020 · 23 comments
Closed

unable to connect agent to master #1523

cvernooy23 opened this issue Mar 12, 2020 · 23 comments

Comments

@cvernooy23
Copy link

cvernooy23 commented Mar 12, 2020

Version:
k3s version v1.17.3+k3s1 (5b17a17)

Describe the bug
unable to join workers to the cluster

To Reproduce
install k3s w/ default options on nodeA
install k3s agent on nodeB using
sudo /usr/local/bin/k3s agent -s https://{my_server_ip}:6443 -t <token from "/var/lib/rancher/k3s/server/node-token" on master node> --with-node-id 1

Expected behavior
node B joins the cluster

Actual behavior
node will not add, cannot access local proxy to the master API

Additional context
No firewalls are on the system or between the two nodes (virtualized nodes, same subnet, directly exposed to LAN)

INFO[2020-03-12T17:11:30.931254304Z] Starting k3s agent v1.17.3+k3s1 (5b17a175)
INFO[2020-03-12T17:11:30.935231883Z] module overlay was already loaded
INFO[2020-03-12T17:11:30.935291956Z] module nf_conntrack was already loaded
INFO[2020-03-12T17:11:30.935301374Z] module br_netfilter was already loaded
INFO[2020-03-12T17:11:30.935629307Z] Running load balancer 127.0.0.1:36635 -> [192.168.33.10:6443]
INFO[2020-03-12T17:11:31.011412790Z] Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log
INFO[2020-03-12T17:11:31.011560067Z] Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
INFO[2020-03-12T17:11:31.012088165Z] Waiting for containerd startup: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory"
INFO[2020-03-12T17:11:32.125317412Z] Updating load balancer server addresses -> [10.0.2.15:6443 192.168.33.10:6443]
INFO[2020-03-12T17:11:32.125686191Z] Connecting to proxy                           url="wss://10.0.2.15:6443/v1-k3s/connect"
ERRO[2020-03-12T17:11:32.125917574Z] Failed to connect to proxy                    error="dial tcp 10.0.2.15:6443: connect: connection refused"
ERRO[2020-03-12T17:11:32.125936712Z] Remotedialer proxy error                      error="dial tcp 10.0.2.15:6443: connect: connection refused"
@juusujanar
Copy link

juusujanar commented Mar 16, 2020

+1, ran into the same issue.

We have k3s server running on a GCE instance. Attempted to connect a remote node to the agent using it's public IP (35.228.x.x). k3s-agent created a load balancer from 127.0.0.1:41933 to 35.228.x.x:6443, but when trying to connect to the proxy, it is using instance's private IP (10.50.x.x), to which the remote node does not have connection to.

Mar 16 19:57:05 rpi01 k3s[6531]: time="2020-03-16T19:57:05.931596718Z" level=info msg="Starting k3s agent v1.17.3+k3s1 (5b17a175)"
Mar 16 19:57:05 rpi01 k3s[6531]: time="2020-03-16T19:57:05.932180024Z" level=info msg="module overlay was already loaded"
Mar 16 19:57:05 rpi01 k3s[6531]: time="2020-03-16T19:57:05.932245023Z" level=info msg="module nf_conntrack was already loaded"
Mar 16 19:57:05 rpi01 k3s[6531]: time="2020-03-16T19:57:05.932286856Z" level=info msg="module br_netfilter was already loaded"
Mar 16 19:57:05 rpi01 k3s[6531]: time="2020-03-16T19:57:05.932996271Z" level=info msg="Running load balancer 127.0.0.1:41933 -> [35.228.x.x:6443]"
Mar 16 19:57:07 rpi01 k3s[6531]: time="2020-03-16T19:57:07.518778640Z" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
Mar 16 19:57:07 rpi01 k3s[6531]: time="2020-03-16T19:57:07.521796075Z" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
Mar 16 19:57:07 rpi01 k3s[6531]: time="2020-03-16T19:57:07.528059443Z" level=info msg="Waiting for containerd startup: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
Mar 16 19:57:08 rpi01 k3s[6531]: time="2020-03-16T19:57:08.644823380Z" level=info msg="Updating load balancer server addresses -> [10.50.0.6:6443 35.228.x.x:6443]"
Mar 16 19:57:08 rpi01 k3s[6531]: time="2020-03-16T19:57:08.645624812Z" level=info msg="Connecting to proxy" url="wss://10.50.0.6:6443/v1-k3s/connect"

@windywork
Copy link

windywork commented Mar 21, 2020

ran into the same issue. This is how I solved it

create cluster master node:

export K3S_NODE_NAME=${HOSTNAME//_/-}
export K3S_EXTERNAL_IP=xx.xx.xx.xx
export INSTALL_K3S_EXEC="--docker --write-kubeconfig ~/.kube/config --write-kubeconfig-mode 666 --tls-san $K3S_EXTERNAL_IP --kube-apiserver-arg service-node-port-range=1-65000 --kube-apiserver-arg advertise-address=$K3S_EXTERNAL_IP --kube-apiserver-arg external-hostname=$K3S_EXTERNAL_IP"
curl -sfL https://docs.rancher.cn/k3s/k3s-install.sh |  sh -

Get Token on master node:

echo -e "export K3S_TOKEN=$(cat /var/lib/rancher/k3s/server/node-token)\nexport K3S_URL=https://$K3S_EXTERNAL_IP:6443\nexport INSTALL_K3S_EXEC=\"--docker --token \$K3S_TOKEN --server \$K3S_URL\""

join workers to the cluster:

export K3S_TOKEN=xxxx
export K3S_URL=https://xx.xx.xx.xx:6443
export INSTALL_K3S_EXEC="--docker --token $K3S_TOKEN --server $K3S_URL"
export K3S_NODE_NAME=${HOSTNAME//_/-}
curl -sfL https://docs.rancher.cn/k3s/k3s-install.sh | sh -

@juusujanar
Copy link

Yup, configuring the master node's external IP worked for me.

@md2119
Copy link

md2119 commented Jun 2, 2020

I am facing the same issue,
On master node:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--write-kubeconfig ~/.kube/config --write-kubeconfig-mode 666 --tls-san <<public-ip>> --kube-apiserver-arg advertise-address=<<public-ip>>" sh -

On agent node:
contents of /etc/systemd/system/k3s-agent.service.env

K3S_TOKEN=<<node-token>>
K3S_URL=https://<<master-public-ip>>:6443
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" sh -

@brandond
Copy link
Contributor

brandond commented Jun 2, 2020

@md2119 don't use --kube-apiserver-arg advertise-address=<<public-ip>>.

Use --node-external-ip=<<public-ip>> instead.

carlosonunez pushed a commit to carlosonunez/hello-azure that referenced this issue Jul 2, 2020
Add Ansible to list of services.

Configure the nodes too

Fix compose syntax

Fix docker invocation

Add Ansible playbook for installing k3s

Don't need that much memory according to docs

Ensure 6443 is available to host...

...so that we can use `kubectl` more easily.

Ignore retry files

Fix path

Use the ssh key for our shared token

syntax

Disable prompt for host keys

syntax

SSH keys in wrong place (do not squash)

Set hostname

Welcome to k3s! Instructions might be out of date.

The instructions provided work great if you're installing k3s locally or
on a cloud instance. They fail hard when you're installing it on VBox
with Vagrant because the CNI, `flannel`, assumes that all network comms
will go out from the default interface, which is the NAT bridge on VBox
VMs. The end-result is that no nodes ever join the cluster and you have
no idea why unless you run `journalctl` to peek into what's going on.

Kubernetes continues to be a powerful pain in the ass.

Sources:

- https://github.com/michaelc0n/k3s/blob/master/Vagrantfile
- k3s-io/k3s#1523

Syntax fixes.

Fix syntax

Only the primary should forward this port
@tcurdt
Copy link

tcurdt commented Feb 3, 2021

Same issue. It just picks the wrong interface.

Not very encouraging that this issue is still open almost a year later :-/

@brandond
Copy link
Contributor

brandond commented Feb 4, 2021

@tcurdt how would you suggest we resolve this issue? It is a support request, not a defect in the software.

Nodes need to be properly configured to support the environment they are deployed in. Absent any cloud-provider-specific integrations, this includes telling them what their public IP address is, if it differs from the IP assigned to the interface. I provided an example of how to do this in the post right above yours.

Are you just here to me-too, or would you care to provide any information on how your environment is configured, what errors you've encountered, and what you've tried so far?

@tcurdt
Copy link

tcurdt commented Feb 4, 2021

IMO it is a defect in the install script. It should detect that there is more than just one option, report that - and not just use the first interface. On top of that I consider this also a "defect" in the documentation.

With that said I am happy to provide more details and to help to resolve this. In fact it is super easy to reproduce as this is Vagrant+VirtualBox setup. Here is a snippet of the most relevant part:

# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:8d:c0:4d brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0
       valid_lft 74283sec preferred_lft 74283sec
    inet6 fe80::a00:27ff:fe8d:c04d/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:e4:99:30 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.101/24 brd 192.168.100.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fee4:9930/64 scope link 
       valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 76:57:da:d0:00:10 brd ff:ff:ff:ff:ff:ff
    inet 10.42.0.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::7457:daff:fed0:10/64 scope link 
       valid_lft forever preferred_lft forever

As for the above example - it sure helped with the joining.

I don't see any node metrics in Lens yet - but I need to check if this could be due to a different reason.

@brandond
Copy link
Contributor

brandond commented Feb 4, 2021

The install script doesn't detect much of anything with regards to network configuration; it's already more complicated than we would like.

Kubernetes and Flannel attempt to guess the best address by looking at what interface is associated with the default route. There are flags to override interface selection, as well as override both the internal and external IPs if you need to do so. You can provide those flags to the install script and it will pass them through to the systemd unit.

In short, there is always some point of complexity after which you are going to have to tell the system what you want. We're glad to help you navigate that, if you show up looking for help instead of just to complain.

@tcurdt
Copy link

tcurdt commented Feb 4, 2021

The install script doesn't detect much of anything with regards to network configuration; it's already more complicated than we would like.

Not sure that check breaks the camels back, but then it's even more important to point at this in the documentation.
If you project a "look, just curl | sh and you are done!" you are creating expectations.

Kubernetes and Flannel attempt to guess the best address by looking at what interface is associated with the default route. There are flags to override interface selection, as well as override both the internal and external IPs if you need to do so. You can provide those flags to the install script and it will pass them through to the systemd unit.

It would be great to mention these details here.

Would you agree?

We're glad to help you navigate that, if you show up looking for help instead of just to complain.

I also happy to help. I can even offer a vagrant file to reproduce this. Not sure that counts as "complaining".

@smnblt
Copy link

smnblt commented Feb 5, 2021

Hello, I run in a similar issue.
My k3s master runs in a vmware VM. I have two interfaces on my machine Wi-Fi and Ethernet. The Wi-fi provides Internet connectivity. The ethernet is connected to a local switch. The virtual machine connects to WI-FI as a NAT (ens33, it gets 192.168.80.129/24 ip addr) while eth is bridged (ens38, with static 192.168.100.1/24 addr).

Note, ens33 is the first address to show up. Further, I have a default route via ens33.

I have a raspberry worker node connected to the local switch via ethetnet (eth0 with with static 192.168.100.2/24 addr). It also has a Wi-Fi access to Internet.
I want the connection between k3s master and worker to happen on the local switch.

In a naive attempt, I set on k3s server deamon fields --bind-address 192.168.100.1 , --advertise-address 192.168.100.1 and --flannel-iface ens38

On worker node I set K3S_URL=https://192.168.100.1:6443

On worker node, looking a systemctl status k3s-agent I see
level=info msg="Connecting to proxy" url="wss://192.168.80.129:6443/v1-k3s/connect"
whcih fails because there is no route to "192.168.80.129".

Then, on K3s server I set --node-ip 192.168.100.1 and --node-external-ip 192.168.100.1.

However, on worker node I still see systemctl status k3s-agent I see
level=info msg="Connecting to proxy" url="wss://192.168.80.129:6443/v1-k3s/connect"

My guess is that k3s master is advertising/binding to the wrong address!

@smnblt
Copy link

smnblt commented Feb 5, 2021

Little update.
I did the following test on the VM where the k3s master is installed.
I turned down the ens33 interface with address 192.168.80.129.
I run systemctl restart k3s

Got the following in journactl -u k3s

Feb 05 13:08:17 masterpi k3s[4132]: I0205 13:08:17.245736    4132 node.go:172] Successfully retrieved node IP: 192.168.80.129
Feb 05 13:08:17 masterpi k3s[4132]: I0205 13:08:17.245796    4132 server_others.go:143] kube-proxy node IP is an IPv4 address (192.168.80.129), assume IPv4 operation

After a while

Feb 05 13:08:37 masterpi k3s[4132]: time="2021-02-05T13:08:37.167535067+01:00" level=info msg="Connecting to proxy" url="wss://192.168.80.129:6443/v1-k3s/connect"
Feb 05 13:08:37 masterpi k3s[4132]: E0205 13:08:37.258872    4132 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.87.158:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.87.158:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.87.158:443: connect: no route to host
Feb 05 13:08:38 masterpi k3s[4132]: W0205 13:08:38.506130    4132 lease.go:233] Resetting endpoints for master service "kubernetes" to [192.168.100.1]
Feb 05 13:08:38 masterpi k3s[4132]: time="2021-02-05T13:08:38.510657974+01:00" level=info msg="Stopped tunnel to 192.168.80.129:6443"
Feb 05 13:08:40 masterpi k3s[4132]: time="2021-02-05T13:08:40.231111049+01:00" level=error msg="Failed to connect to proxy" error="dial tcp 192.168.80.129:6443: connect: no route to host"

ip route show output is

default via 192.168.100.1 dev ens38 
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1 
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink 
169.254.0.0/16 dev cni0 scope link metric 1000 
192.168.100.0/24 dev ens38 proto kernel scope link src 192.168.100.1 metric 101 

From where k3s is picking the 192.168.80.129 address?

@ElefHead
Copy link

Anyone who got raspberry-pis and straight up jumped into installing k3s and are having this issue - please learn from me, a long time linux user who didn't run

sudo apt update
sudo apt upgrade

before proceeding.
Save yourselves a day of googling.

@falterfriday
Copy link

falterfriday commented May 9, 2021

If you're running a traefik reverse proxy as your external load balancer in a HA config, this is what did the trick for me:

add loadbalancer server port label:
traefik.http.services.k3s.loadbalancer.server.port=6443

add an entrypoint:

entryPoints:
  http:
    address: ":80"
  https:
    address: ":443"
  k3s:
    address: ":6443"

add TCP router and service:

tcp:
  services:
    k3s:
      loadBalancer:
        servers:
        - address: "10.240.0.11:6443"
        - address: "10.240.0.12:6443"
  routers:
    k3s:
      entryPoints:
        - "k3s"
      rule: "HostSNI(`*`)"
      tls:
        passthrough: true
      service: k3s

@lkj4
Copy link

lkj4 commented Jun 15, 2021

I still can't get agents to connect to servers. Servers happily connect to each other but agents don't. I tried it with --node-ip, --node-external-ip, no success. Below my script, important for this thread is just the function and within, calling the install script with args and flags:

#!/bin/bash
set -x
# Following line creates an array of all public ips of the created VPS
nodes=(`terraform output -raw ips`)

install_k3s () {
  if [ $run -eq 0 ]; then
    append=" --cluster-init"
  else
    append=" --server https://${nodes[0]}:6443"
  fi

  [ $run -gt 2 ] && role="agent"

  ssh-keyscan -H $1 >> ~/.ssh/known_hosts
  sleep .5

  ssh root@$1 "swapoff -a"
  ssh root@$1 "curl -sfL https://get.k3s.io | \
    INSTALL_K3S_CHANNEL=latest \
    INSTALL_K3S_VERSION=v1.21.0+k3s1 \
    sh -s - \
    $role \
    --node-external-ip $1 \
    --token someInsecureToken \
    --disable traefik \
    --disable servicelb \
    $append"

  ((++run))
}

run=0
role="server"
for e in ${nodes[*]}; do
  install_k3s $e
done
set +x

Edit: To clarify, when I run kubectl top nodes or kubectl get nodes I only get the control plane nodes listed.

@brandond
Copy link
Contributor

@lkj4 what do the logs on your agents say?

@lkj4
Copy link

lkj4 commented Jun 16, 2021

@brandond this tip was good, I think agents don't have --disable, see below; I'll adjust the script and report back... Edited: It works now with the --disable flags being removed, thanks again for this great support!!

ssh root@<public IP of this host> 'curl -sfL https://get.k3s.io |     INSTALL_K3S_CHANNEL=latest     INSTALL_K3S_VERSION=v1.21.0+k3s1     sh -s -     agent     --node-external-ip <public IP of this host>    --token <token>     --disable traefik     --disable servicelb      --server https://<public ip of first control plane>:6443'
...
[INFO]  systemd: Starting k3s-agent

Ok, I got following gazillion times...

cat /var/log/syslog
...
Jun 16 12:07:23 guest k3s[2508]: time="2021-06-16T12:07:23Z" level=fatal msg="flag provided but not defined: -disable"                                                                                                                                                                                                                                                                                                                                                                          Jun 16 12:07:23 guest systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE                                                                                                                                                                                                                                                                                                                                                                                         Jun 16 12:07:23 guest systemd[1]: k3s-agent.service: Failed with result 'exit-code'.                                                                                                                                                                                                                                                                                                                                                                                                            Jun 16 12:07:28 guest systemd[1]: k3s-agent.service: Scheduled restart job, restart counter is at 34.                                                                                                                                                                                                                                                                                                                                                                                           Jun 16 12:07:28 guest systemd[1]: Stopped Lightweight Kubernetes.                                                                                                                                                                                                                                                                                                                                                                                                                               Jun 16 12:07:28 guest systemd[1]: Starting Lightweight Kubernetes...                                                                                                                                                                                                                                                                                                                                                                                                                            Jun 16 12:07:28 guest systemd[1]: Started Lightweight Kubernetes.                                                                                                                                                                                                                                                                                                                                                                                                                               Jun 16 12:07:28 guest k3s[2533]: Incorrect Usage: flag provided but not defined: -disable                                                                                                                                                                                                                                                                                                                                                                                                       Jun 16 12:07:28 guest k3s[2533]: NAME:                                                                                                                                                                                                                                                                                                                                                                                                                                                          Jun 16 12:07:28 guest k3s[2533]:    k3s agent - Run node agent

@woniupapa
Copy link

Deal All,
I created master,node1,node2, there are three peers, i have some problem with agent node

  1. in master, kubectl get nodes
    image
  2. in node2, kubectl get nodes
    image

why node2 can't connect master.., help!!!

@brandond
Copy link
Contributor

brandond commented Jun 22, 2021

@woniupapa can you provide any relevant information, such as logs from the agent?

Note that you cannot run kubectl from the agent, as it does not host the Kubernetes control-plane and does not have a copy of the k3s admin kubeconfig.

@davesargrad
Copy link

davesargrad commented Jul 22, 2021

I also have this issue (see here). Is there a clean fix? Or clean documentation that describes the fix?

I've tried to set K3S_EXTERNAL_IP.. this is still not working.

See here for my solution.

@Student-Jasons
Copy link

The above comment fixed the problem! I will close the matter

@nbbn
Copy link

nbbn commented Oct 28, 2021

I've got the same problem with Ubuntu 21.10 Server on RPi.
I simply followed Quick-Start Guide.
It's disappoint that this problem still exists after so much time since first report.

@brandond
Copy link
Contributor

@nbbn you're probably best off opening a new issue describing what specifically you're having problems with, this one has become a bit of a dumping ground for folks with unrelated and/or poorly described issues.

@k3s-io k3s-io locked as off-topic and limited conversation to collaborators Oct 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests