Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/srv/kubernetes/kubelet-server.crt expired, did not auto-renew #15970

Closed
darintay opened this issue Sep 27, 2023 · 5 comments
Closed

/srv/kubernetes/kubelet-server.crt expired, did not auto-renew #15970

darintay opened this issue Sep 27, 2023 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@darintay
Copy link

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.25.4

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.23.5

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
A node that was running for 400+ days had its kubelet-server.crt certificate expire, which broke all pods on the node.

$ sudo openssl x509 -enddate -noout -in /srv/kubernetes/kubelet-server.crt
notAfter=Sep 27 06:52:10 2023 GMT

Is something supposed to be auto-renewing this certificate? I couldn't see anything in the kubelet logs about it. I know that kubelets have certificate rotation (https://kubernetes.io/docs/tasks/tls/certificate-rotation/) but I don't know if that is supposed to be covering this file, or if it's something on the kops side.

Not sure if there's an easy way to test/reproduce this due to the duration of these certs.

(I know ideally I'd be doing control plane upgrades frequently enough that this doesn't matter, but would like to sort this out for if/when that doesn't happen again in future)

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 27, 2023
@johngmyers
Copy link
Member

No, kops expects you to update nodes at least every 455 days.

@darintay
Copy link
Author

OK, good to at least know that's expected behavior, thanks.

@doryer
Copy link

doryer commented Dec 3, 2023

@johngmyers rotateCertificates is not used for this kind of use-case?

@aviadhaham
Copy link

Hey @darintay @johngmyers, following the above:

my company has inherited k8s clusters made and maintained by kOps, and one of them (fortunately, the non-production one) started having this issue of nodes going offline due to the expiration of the kubelet-server.crt, as it seems.

I’m wondering what’s the best approach here?

I noticed you mentioned that a manual kops update cluster is needed every at least 455 days, but, what should I do if this cluster wasn’t updated in time, and some of its nodes already have their certificates expired?
When I try to update, and part of the procedure is cluster validation, it fails due to the problematic nodes (that their certificate expired) that their validation status returns as below:

node "ip-172-23-45-127.ec2.internal" of role "node" is not ready

We're pretty stuck here and we have more clusters (production) that we afraid that would have the same issue, and we won't know how to handle it properly.

Thank you in advance!

@schwing
Copy link

schwing commented Feb 27, 2024

@aviadhaham This may be too late, but I'll comment here in case anyone else runs into this issue.

The --cloudonly flag is necessary when the control plane is down.

If you're attempting to recover without also running an update to limit the number of moving parts during recovery, you'll need the --force flag to tell kops to update even if no updates are required.

Fixing the control plane first before moving on to nodes is a good idea, so an example of doing that first: kops rolling-update cluster --instance-group-roles master --cloudonly --force --yes. Once the control plane is healthy and the Kubernetes API is working again, do similar for the node instance group role or individual instance groups.

Of course, updating more often in the future to avoid this should be a priority--but this will get things running again so you can focus on updating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants