Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of INSTALL_K3S_EXEC can break cluster join #7141

Closed
itz-Jana opened this issue Mar 23, 2023 · 5 comments
Closed

Order of INSTALL_K3S_EXEC can break cluster join #7141

itz-Jana opened this issue Mar 23, 2023 · 5 comments

Comments

@itz-Jana
Copy link

itz-Jana commented Mar 23, 2023

Environmental Info:
K3s Version:

v1.25.7+k3s1

Node(s) CPU architecture, OS, and Version:

Linux ananke 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux (same for both nodes)

Cluster Configuration:

2 servers (10.0.0.13 and 10.0.0.14)

Describe the bug:

Trying to install K3s with custom arguments, disabling Flannel and changing the CIDRs.
Joining the second cluster node fails, depending on the order of the arguments in INSTALL_K3S_EXEC.

Steps To Reproduce:

Base System: Clean installed Debian 11

Scenario 1:

Node 1:

export INSTALL_K3S_VERSION=v1.25.7+k3s1
export INSTALL_K3S_EXEC="--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --disable traefik --disable servicelb --flannel-backend=none --disable-network-policy"
curl -sfL https://get.k3s.io | sh -s - server --cluster-init
cat /var/lib/rancher/k3s/server/node-token

Node 2:

export INSTALL_K3S_VERSION=v1.25.7+k3s1
export INSTALL_K3S_EXEC="--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --disable traefik --disable servicelb --flannel-backend=none --disable-network-policy"
export K3S_TOKEN="..." (copied from Node 1)
curl -sfL https://get.k3s.io | sh -s - server --server https://10.0.0.13:6443

Scenario 2:

Exakt steps as Scenario 1, except with these EXEC args on BOTH nodes (same args, just different order):

export INSTALL_K3S_EXEC="--flannel-backend=none --disable-network-policy --cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --disable traefik --disable servicelb"

Expected behavior:

Scenario 1 and Scenario 2 both create a working k3s cluster with 2 nodes.

Actual behavior:

Scenario 2 creates a working k3s cluster with 2 nodes.
In Scenario 1, the second node never appears in "kubectl get nodes" on Node 1.
Running the command on Node 2 also only shows the 2. node, so it seems as if 2 individual cluster were created.

Additional context / logs:

Logs for service k3s taken from journalctl:

Scenario 1:
node1logs_scenario1.txt
node2logs_scenario1.txt

Scenario 2:
node1logs_scenario2.txt
node2logs_scenario2.txt

@brandond
Copy link
Member

Can you show the resulting systemd units for the different scenarios? The same flag set should be written regardless of the order; I suspect maybe something else is going on with regards to quoting or something.

@itz-Jana
Copy link
Author

Doesn't seem like it, the diff between the scenarios on both nodes looks like this:

30a31,32
>         '--flannel-backend=none' \
>         '--disable-network-policy' \
37,38d38
<         '--flannel-backend=none' \
<         '--disable-network-policy' \

So just the arguments moving place.

Here are the full units:

Scenario 1:

Node 1:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56' \
        '--service-cidr=10.43.0.0/16,2001:cafe:42:1::/112' \
        '--disable' \
        'traefik' \
        '--disable' \
        'servicelb' \
        '--flannel-backend=none' \
        '--disable-network-policy' \
        'server' \
        '--cluster-init' \

Node 2:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56' \
        '--service-cidr=10.43.0.0/16,2001:cafe:42:1::/112' \
        '--disable' \
        'traefik' \
        '--disable' \
        'servicelb' \
        '--flannel-backend=none' \
        '--disable-network-policy' \
        'server' \
        '--server' \
        'https://10.0.0.2:6443' \

Scenario 2:

Node 1:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--flannel-backend=none' \
        '--disable-network-policy' \
        '--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56' \
        '--service-cidr=10.43.0.0/16,2001:cafe:42:1::/112' \
        '--disable' \
        'traefik' \
        '--disable' \
        'servicelb' \
        'server' \
        '--cluster-init' \

Node 2:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s \
    server \
        '--flannel-backend=none' \
        '--disable-network-policy' \
        '--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56' \
        '--service-cidr=10.43.0.0/16,2001:cafe:42:1::/112' \
        '--disable' \
        'traefik' \
        '--disable' \
        'servicelb' \
        'server' \
        '--server' \
        'https://10.0.0.2:6443' \

@brandond
Copy link
Member

Those are all invalid constructions, as the server arg is included twice. It's just that when you do it the first way it's probably being parsed and ignored as the value of the --disable-network-policy flag.

I'm not sure why it's in there twice though...

@brandond
Copy link
Member

brandond commented Mar 24, 2023

You can see what's going on here if you do this:

systemd-node-1:/ # curl -sL get.k3s.io | INSTALL_K3S_SKIP_START=true INSTALL_K3S_EXEC="--exec-flag-one foo --exec-flag-two" sh -s - --shell-flag-one bar --shell-flag-two
[...]

systemd-node-1:/ # grep -A10 ExecStart= /etc/systemd/system/k3s.service
ExecStart=/usr/local/bin/k3s \
    server \
	'--exec-flag-one' \
	'foo' \
	'--exec-flag-two' \
	'--shell-flag-one' \
	'bar' \
	'--shell-flag-two' \

The shell args go after the INSTALL_K3S_EXEC args. If you are going to use both to pass through args (which I wouldn't recommend; pick one) you shouldn't include the server command in the shell args, as those come second and will confuse things.

@itz-Jana
Copy link
Author

Oh okay, interesting.
I knew that these args were somehow interchangeable, but I didn't understand quite how that worked.
So just to confirm, valid constructs would be either:

export INSTALL_K3S_EXEC="--cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --disable traefik --disable servicelb --flannel-backend=none --disable-network-policy --server https://10.0.0.13:6443"
curl -sfL https://get.k3s.io | sh -s -

or

curl -sfL https://get.k3s.io | sh -s - server --cluster-cidr=10.42.0.0/16,2001:cafe:42:0::/56 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --disable traefik --disable servicelb --flannel-backend=none --disable-network-policy --server https://10.0.0.13:6443

Correct?

And if I would want to properly fix my unit files, without reinstalling I would just remove the 'server' argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants