Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from version 1.24.10 to 1.25.7 results in Failed to save TLS secret for kube-system/k3s-serving #7123

Closed
dcarrion87 opened this issue Mar 20, 2023 · 10 comments

Comments

@dcarrion87
Copy link

dcarrion87 commented Mar 20, 2023

Environmental Info:
K3s Version: 1.24.10 to 1.25.7

Node(s) CPU architecture, OS, and Version:

Linux ip-10-X-X-X 5.15.0-1026-aws #30-Ubuntu SMP Wed Nov 23 17:01:09 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration:

2 servers, 1 agent

Describe the bug:

We're testing upgrade between 1.24.10 and 1.25.7 on the servers. After upgrade these errors keep throwing repeatedly in logs:

Mar 20 22:46:23 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:23Z" level=error msg="Failed to save TLS secret for kube-system/k3s-serving: Operation cannot be fulfilled on secrets \"k3s-serving\": the object has been modified; please apply your changes to the latest version and >
Mar 20 22:46:22 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:22Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>
Mar 20 22:46:22 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:22Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>
Mar 20 22:46:21 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:21Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>
Mar 20 22:46:21 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:21Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>
Mar 20 22:46:21 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:21Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>

Steps To Reproduce:

  • Updated k3s binaries on server 1
  • Updated k3s binaries on server 2
  • Checked journalctl logs
@brandond
Copy link
Contributor

brandond commented Mar 20, 2023

We're testing upgrade between 1.24.10 and 1.25.10 on the servers

Can you confirm the versions that you are upgrading from and to? There is no v1.25.10 yet; the latest GA 1.25 release available is v1.25.7+k3s1 - which is the version you mentioned elsewhere.

Mar 20 22:46:21 ip-10-X-X-X k3s[3915]: time="2023-03-20T22:46:21Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X listener.cattle.io/cn-10.X.X.X:10.X.X.X>

The logs are truncated at the end of the line; can you attach the complete logs from journald without any terminal-width truncation?

2 servers, 1 agent

What are you using for the datastore on your two server nodes?

Have you upgraded both servers, or just one?

@dcarrion87
Copy link
Author

dcarrion87 commented Mar 20, 2023

  • 1.24.10 to 1.25.7
  • Logs re-attached:
Mar 20 23:00:51 ip-X-X-4-252 k3s[3915]: time="2023-03-20T23:00:51Z" level=error msg="Failed to save TLS secret for kube-system/k3s-serving: Operation cannot be fulfilled on secrets \"k3s-serving\": the object has been modified; please apply your changes to the latest version and try again"
Mar 20 23:00:51 ip-X-X-4-252 k3s[3915]: time="2023-03-20T23:00:51Z" level=info msg="Updating TLS secret for kube-system/k3s-serving (count: 24): map[listener.cattle.io/cn-X.X.4.221:X.X.4.221 listener.cattle.io/cn-X.X.4.252:X.X.4.252 listener.cattle.io/cn-X.X.5.109:X.X.5.109listener.cattle.io/cn-X.X.5.194:X.X.5.194 listener.cattle.io/cn-X.X.5.69:X.X.5.69 listener.cattle.io/cn-X.X.5.84:X.X.5.84 listener.cattle.io/cn-X.X.6.32:X.X.6.32 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.17.0.1:172.17.0.1 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-host.docker.internal:host.docker.internal listener.cattle.io/cn-ip-X-X-4-221:ip-X-X-4-221 listener.cattle.io/cn-ip-X-X-4-252:ip-X-X-4-252 listener.cattle.io/cn-ip-X-X-5-109:ip-X-X-5-109 listener.cattle.io/cn-ip-X-X-5-194:ip-X-X-5-194 listener.cattle.io/cn-ip-X-X-5-69:ip-X-X-5-69 listener.cattle.io/cn-ip-X-X-5-84:ip-X-X-5-84 listener.cattle.io/cn-ip-X-X-6-32:ip-X-X-6-32 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-REDACTED-57fe667ded5ca-319b95:REDACTED-57fe667ded5caa94.elb.ap-southeast-2.amazonaws.com listener.cattle.io/fingerprint:SHA1=A4F0DBA2A022067940113F2F8B2174F53F70BE63]"
  • Datastore is Postgres
  • Both servers have been upgraded
  • I have just tried restarting both again now and it's still throwing that error endlessly (seconds apart).

@brandond
Copy link
Contributor

Can you confirm that the cli flags on both servers (tls-san values in particular) are in sync?

Have you tried stopping one of the servers for a period of time so that the other can start up and successfully update the secret?

Due to the fact that you've posted just a single redacted log line from a single server I can't really tell what it's trying to change on the cert. Can you compare the logged certificate annotations between the two servers to see what they are alternating trying to set?

@dcarrion87
Copy link
Author

dcarrion87 commented Mar 20, 2023

Is that something we should be accomodating when upgrading the servers?

The upgrade was done with only one server with that version for 5 minutes. I thought it was supported to have all server running during upgrade?

Appreciate it's difficult with redacted values. Can confirm tls san values are in sync. Config below.

#  -----  Ansible Managed  -----  #

datastore-endpoint: postgres://REDACTED
token: REDACTED
tls-san: REDACTED.elb.ap-southeast-2.amazonaws.com
agent-token: REDACTED
etcd-disable-snapshots: true
cluster-cidr: 172.16.0.0/16
service-cidr: 172.17.0.0/16
flannel-backend: none
disable-network-policy: true
write-kubeconfig: /root/.kube/config
disable: traefik
node-taint: CriticalAddonsOnly=true:NoExecute
kube-scheduler-arg:
- "config=/etc/rancher/k3s/scheduler-config.yaml"
kube-apiserver-arg:
- "enable-admission-plugins=AlwaysPullImages"
- "service-account-jwks-uri=https://REDACTED.s3.amazonaws.com/REDACTED-cluster/openid/v1/jwks"
- "service-account-issuer=https://REDACTED.s3.amazonaws.com/REDACTED-cluster"

@dcarrion87
Copy link
Author

dcarrion87 commented Mar 21, 2023

@brandond could you point me at any docos that talks about what k3s-serving secret and what it's actually wanting to do?

@dcarrion87
Copy link
Author

How odd, it's now stopped... Restarted k3s a few more times and it's fine...

@brandond
Copy link
Contributor

brandond commented Mar 21, 2023

There aren't any docs about this specific implementation detail, but that secret is used to store the dynamically-generated server certificate for the apiserver/supervisor listener on port 6443. The certificate is updated with SAN entries for any requested names and addresses, as well as any names or addresses requested by clients. The observed behavior suggests that there were some hostnames or addresses that both nodes were attempting to add, but were unable to do so due to reoccurring conflicts.

Its hard to tell specifically what they were conflicting on without looking at full unredacted logs.

The upgrade was done with only one server with that version for 5 minutes. I thought it was supported to have all server running during upgrade?

yes, that's fine. We haven't really touched anything in this space in quite a while, so I'm not sure what exactly would be causing this. I suggested stopping one of the servers for a bit because that would break the conflict cycle and allow the other node to do whatever it wanted to do.

@caroline-suse-rancher
Copy link
Contributor

Closing since the problem resolved. Please reopen if it re-emerges.

@oivindoh
Copy link

oivindoh commented Jul 5, 2023

In case I stumble upon this again - also hit this issue in a four node cluster with all the nodes trying to apply seemingly identical updates, after two nodes had been restarted with 1.25.10. Eventually decided to manually upgrade k3s by replacing the binary and restart remaining nodes. This seemed to clear it up.

@sidewinder12s
Copy link

Also hit this on the 1.24 -> 1.25 upgrade. Once we got all control plane nodes moved over to 1.25 it went away, so I am assuming some part of the secret/SAN changed between 1.24 and 1.25 with the nodes fighting over it until the last 1.24 node went away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants