Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster API doesn't directly support certificate renewal #9662

Closed
vishu2498 opened this issue Nov 2, 2023 · 6 comments
Closed

Cluster API doesn't directly support certificate renewal #9662

vishu2498 opened this issue Nov 2, 2023 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@vishu2498
Copy link

What would you like to be added (User Story)?

Cluster API should be having a direct role in renewing the certificates of a cluster in the cluster's lifecycle. In its reconcilation process, it should be checking that if the certificates are expired, it should automatically renew them.

Detailed Description

The certificates that are generated by kubeadm are valid for only 1 year.

Anyone can check for the certificate expiry by this command inside the control-plane node of a cluster:

kubeadm certs check-expiration

To renew the certificates in a cluster, the command directly provided by kubernetes is:

kubeadm certs renew all

However, the problem here is that user currently won't know when the certificates are going to expire. Even though the user gets to know them and does the cert renewal by themselves, the process is still not complete.

The reason is that after the cert renewal happens, we are greeted with this message from the output:

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

Since these are core components of kubernetes and static pods too, just restarting the pods won't make them use the new certificates. For the static pods, we have to remove the files of these pods from "/etc/kubernetes/manifests/" directory, wait for a few seconds, let the pods get deleted, and put the files back in. This process should be done manually as suggested by official Kubernetes document: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal

Also, the cluster will have a little bit of downtime till the core components of Kubernetes get back up.

Challenges observed when the solution can be implemented:

  • We may lose cluster access for a while.
  • The certs should not be deleted in any case. Since as per current testing, the "kubeadm" command for certificate renewal is failing for creating new certificates if any of the certificate is not present.
  • When the reconcilation process is occurring, if the process fails, the code should back up the manifests in the next reconcile.
  • While the code does this operation, there may be a chance that the CAPI pod gets deleted in the middle of moving files in and out. If it recovers in-between and get into running state, it should continue the process. This means that each file should be handled individually. However, if it goes in "CrashLoopBackOff" state, cluster will not be recoverable from that state.

A few people have already made some bash scripts or golang code to perform this operation. Here's one reference for bash script:

cc @snehala27 @sadysnaat

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 2, 2023
@neolit123
Copy link
Member

neolit123 commented Nov 2, 2023

Cluster API should be having a direct role in renewing the certificates of a cluster in the cluster's lifecycle. In its reconcilation process, it should be checking that if the certificates are expired, it should automatically renew them.

note that this contradicts the original CAPI immutable machine design - i.e. you are not supposed to change a machine to renew its certificates or change it's state once it's added. instead, new machines can join the cluster before the certificates expire and old machines can be removed.

that said, there is a new feature group that is about to discuss in-place upgrades in CAPI. and if CAPI uses kubeadm upgrade to do the in-place upgrades, it will benefit from the certificate rotation that kubeadm upgrade does by default.

#9559
#9489
#7415

@killianmuldoon
Copy link
Contributor

/triage accepted

Let's keep this open for reference for the in-place upgrades folks to discuss, and keep open or close as they'd prefer.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 2, 2023
@fabriziopandini
Copy link
Member

Isn't this what we are describing in https://cluster-api.sigs.k8s.io/tasks/certs/auto-rotate-certificates-in-kcp?

@oblazek
Copy link

oblazek commented Apr 4, 2024

Isn't this what we are describing in https://cluster-api.sigs.k8s.io/tasks/certs/auto-rotate-certificates-in-kcp?

Yeah, seems to me it's exactly that.

@fabriziopandini
Copy link
Member

/close
as per comment above

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
as per comment above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants