Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap of HA cluster fails with when agent bootstrap token used as server token #7313

Closed
discordianfish opened this issue Apr 19, 2023 · 8 comments
Assignees
Milestone

Comments

@discordianfish
Copy link
Contributor

Environmental Info:
K3s Version: v1.26.4-rc1+k3s1

Node(s) CPU architecture, OS, and Version:

Linux ip-10-10-190-56 5.10.0-21-cloud-arm64 #1 SMP Debian 5.10.162-1 (2023-01-21) aarch64 GNU/Linux (AWS Graviton)

Cluster Configuration:
2 servers, 1 agent

Describe the bug:
When starting the servers using a mysql (maria) storage endpoint with a empty database, bootstrapping fails with:

Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=info msg="Starting k3s v1.26.4-rc1+k3s1 (257fa2c5)"
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=info msg="Configuring mysql database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=info msg="Database tables and indexes are up to date"
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=info msg="Kine available at unix://kine.sock"
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=warning msg="bootstrap data encrypted with empty string, deleting and resaving with token"
Apr 19 10:28:58 ip-10-10-190-56 k3s[5498]: time="2023-04-19T10:28:58Z" level=fatal msg="starting kubernetes: preparing server: invalid cipher text, not : delimited"
Apr 19 10:28:58 ip-10-10-190-56 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE

Steps To Reproduce:

  • Installed server like this:
    curl -sfL https://get.k3s.io \
      | INSTALL_K3S_VERSION=v1.26.4-rc1+k3s1 K3S_TOKEN=${k3s_token} \
      sh -s - server \
        --disable traefik \
        --disable-helm-controller \
        --kubelet-arg containerd=/run/k3s/containerd/containerd.sock \
        --datastore-endpoint="${datastore_endpoint}" \
        --tls-san="k8s.example.com"

...where k3s_token is generated by this TF snippet:

resource "random_password" "k3s_token_id" {
  length  = 6
  upper   = false
  special = false
}

resource "random_password" "k3s_token_secret" {
  length  = 16
  upper   = false
  special = false
}

locals {
  k3s_token = "${random_password.k3s_token_id.result}.${random_password.k3s_token_secret.result}"
}

and datastore_endpoint is: "mysql://${aws_db_instance.k3s.username}:${random_password.k3s_password.result}@tcp(${aws_db_instance.k3s.endpoint})/${aws_db_instance.k3s.db_name}"

Expected behavior:
Bootstrap working

Actual behavior:
Bootstrap not working

Additional context / logs:
Here is the db content. It looks like something created the bootstrap key with NULL value. Probably make sense to make this field non null to catch the error earlier:

MariaDB [k3s]> select * from kine;
+----+-------------------------+---------+---------+-----------------+---------------+-------+-------------------+-----------+
| id | name                    | created | deleted | create_revision | prev_revision | lease | value             | old_value |
+----+-------------------------+---------+---------+-----------------+---------------+-------+-------------------+-----------+
|  1 | compact_rev_key         |       1 |       0 |               0 |             0 |     0 |                   | NULL      |
|  2 | /registry/health        |       1 |       0 |               0 |             0 |     0 | {"health":"true"} | NULL      |
|  3 | /bootstrap/e3b0c44298fc |       1 |       0 |               0 |             0 |     0 | NULL              | NULL      |
+----+-------------------------+---------+---------+-----------------+---------------+-------+-------------------+-----------+
3 rows in set (0.001 sec)
@discordianfish
Copy link
Contributor Author

Here is the full log for server a:

Apr 19 10:05:29 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:29Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Apr 19 10:05:29 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:29Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/6aec120e8549d8021bfce4c67acfa251e1513537675da4f56b6c68a477df1ed6"
Apr 19 10:05:30 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:30Z" level=info msg="Starting k3s v1.26.4-rc1+k3s1 (257fa2c5)"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Configuring mysql database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Database tables and indexes are up to date"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Kine available at unix://kine.sock"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Bootstrap key locked for initial create"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="generated self-signed CA certificate CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31.157779629 +0000 UTC notAfter=2033-04-16 10:05:31.157779629 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:kube-scheduler signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:apiserver,O=system:masters signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:kube-proxy signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:k3s-controller signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=k3s-cloud-controller-manager signed by CN=k3s-client-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="generated self-signed CA certificate CN=k3s-server-ca@1681898731: notBefore=2023-04-19 10:05:31.163753133 +0000 UTC notAfter=2033-04-16 10:05:31.163753133 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="generated self-signed CA certificate CN=k3s-request-header-ca@1681898731: notBefore=2023-04-19 10:05:31.165113387 +0000 UTC notAfter=2033-04-16 10:05:31.165113387 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=system:auth-proxy signed by CN=k3s-request-header-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="generated self-signed CA certificate CN=etcd-server-ca@1681898731: notBefore=2023-04-19 10:05:31.166594568 +0000 UTC notAfter=2033-04-16 10:05:31.166594568 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=etcd-client signed by CN=etcd-server-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="generated self-signed CA certificate CN=etcd-peer-ca@1681898731: notBefore=2023-04-19 10:05:31.168594877 +0000 UTC notAfter=2033-04-16 10:05:31.168594877 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Saving cluster bootstrap data to datastore"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1681898731: notBefore=2023-04-19 10:05:31 +0000 UTC notAfter=2024-04-18 10:05:31 +0000 UTC"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=warning msg="dynamiclistener [::]:6443: no cached certificate available for preload - deferring certificate load until storage initialization or first client request"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=info msg="Active TLS secret / (ver=) (count 11): map[listener.cattle.io/cn-10.10.190.56:10.10.190.56 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-ip-10-10-190-56:ip-10-10-190-56 list>
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=warning msg="bootstrap data encrypted with empty string, deleting and resaving with token"
Apr 19 10:05:31 ip-10-10-190-56 k3s[1230]: time="2023-04-19T10:05:31Z" level=fatal msg="starting kubernetes: preparing server: invalid cipher text, not : delimited"

And server b:

Apr 19 10:05:29 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:29Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Apr 19 10:05:29 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:29Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/6aec120e8549d8021bfce4c67acfa251e1513537675da4f56b6c68a477df1ed6"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=info msg="Starting k3s v1.26.4-rc1+k3s1 (257fa2c5)"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=info msg="Configuring mysql database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=info msg="Configuring database table schema and indexes, this may take a moment..."
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=info msg="Database tables and indexes are up to date"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=info msg="Kine available at unix://kine.sock"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=warning msg="bootstrap data encrypted with empty string, deleting and resaving with token"
Apr 19 10:05:31 ip-10-10-159-47 k3s[1235]: time="2023-04-19T10:05:31Z" level=fatal msg="starting kubernetes: preparing server: invalid cipher text, not : delimited"

Maybe another race condition? That both logged bootstrap data encrypted with empty string, deleting and resaving with token seems suspicious...

@discordianfish
Copy link
Contributor Author

I think #7215 just isn't working and now instead of servers coming up overwriting the certs it just stuck in this condition.
Also note how it's logging "bootstrap data encrypted with empty string, deleting and resaving with token" but neither logrus.Info("Bootstrap key lock is held") nor "created bootstrap key", making me believe that doMigrateToken() was called but returned early here: https://github.com/k3s-io/k3s/pull/7215/files#diff-be0b49e298e5df44844d727540c02782c68df28c16fbb3da4fe3b4955bbb7591R306

Was the locking mechanism tested with 'old tokens'? I assume 'old tokens' refers to 'short' tokens? https://docs.k3s.io/cli/token#short

@discordianfish
Copy link
Contributor Author

Doesn't even bootstrap with a single server:

  • Scaled down to one
  • Stopped k3s
  • Truncated kine table
  • Terminated instance
    -> Clean slate with fresh instance and empty db
  • Same problem

@discordianfish
Copy link
Contributor Author

Seem this is a regression. Went back to v1.26.3+k3s1 which works (well, as long as I only run one initial server)

@brandond
Copy link
Contributor

brandond commented Apr 19, 2023

It looks like for some reason you're using a kubeadm-style bootstrap token for the server token: an id and secret joined by a period. Don't do that, you are likely confusing the token parser. We do support kubeadm style bootstrap tokens - but not for servers, as covered in the docs.

Just use a random alphanumeric string of arbitrary length for the server token.

@discordianfish
Copy link
Contributor Author

@brandond hrmm I got this from the docs here: https://docs.k3s.io/cli/token#k3s-token-generate

You don't have to use this command in order to generate a token. You can do so yourself as long as it is in the format "[a-z0-9]{6}.[a-z0-9]{16}", where the first portion is the token ID, and the second portion is the secret.

I guess there are not bootstrap but join tokens? Some clarification here https://docs.k3s.io/datastore/ha would be great!

@brandond
Copy link
Contributor

See the docs at https://docs.k3s.io/cli/token#token-types which describes the three token types, and says:

K3s supports dynamically generated, automatically expiring agent bootstrap tokens. Bootstrap tokens can only be used to join agents.

The format for bootstrap type tokens is subsequently documented, which appears to be where you got the inspiration for your token generation template, but you're attempting to use it as a server token which is not supported.

the format "[a-z0-9]{6}.[a-z0-9]{16}", where the first portion is the token ID, and the second portion is the secret.

I'll leave this issue open to track having a better error message when bootstrap type tokens are used as the server token. In the mean time, you can probably get your existing TF working if you just remove the . from the middle of the token, so that it does not match the format expected by the bootstrap token parser.

@VestigeJ
Copy link

##Environment Details
Baselined existing behavior with a bad token using VERSION=v1.27.1+k3s1
Verified new behavior running COMMIT=

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150400.24.11-default x86_64 GNU/Linux 
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"

Cluster Configuration:

Config.yaml:

write-kubeconfig-mode: 644
debug: true
token: grygel.1mf673llhuiin9xo
protect-kernel-defaults: true
cluster-init: true

Existing behavior generates a lot of logs with bad certificates and TLS handshake errors when using a malformed token

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ sudo INSTALL_K3S_VERSION=v1.27.1+k3s1 INSTALL_K3S_SKIP_ENABLE=true INSTALL_K3S_TYPE=server ./install-k3s.sh 
$ k3s token generate //paste generated token into the /etc/rancher/k3s/config.yaml like above
$ sudo k3s server //observe errors to stdout 
$ get_report //generate this template

Results:

Cluster node rightfully fails to startup with a malformed token but the reason isn't obvious

INFO[0036] Cluster-Http-Server 2023/05/10 18:53:41 http: TLS handshake error from 13.58.182.128:58480: remote error: tls: bad certificate 
INFO[0036] Cluster-Http-Server 2023/05/10 18:53:41 http: TLS handshake error from 13.58.182.128:58508: remote error: tls: bad certificate 
ERRO[0036] Failed to authenticate request from 13.58.182.128:58520: invalid bearer token 

Confirmed better error surfaces to the use with the malformed token

Validation Steps

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl
$ sudo INSTALL_K3S_COMMIT=b4bc57d049668880d438c257491d15950f396f34 INSTALL_K3S_SKIP_ENABLE=true INSTALL_K3S_TYPE=server ./install-k3s.sh  
$ k3s token generate //paste generated token into the /etc/rancher/k3s/config.yaml like above before starting k3s
$ sudo k3s server //observe errors to stdout 
$ get_report //generate this template

Results:

Expect to observe the following output to stdout

failed to normalize server token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASSWORD> 

from journalctl -u k3s

FATA[0008] failed to normalize server token; must be in format K10<CA-HASH>::<USERNAME>:<PASSWORD> or <PASSWORD>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants