Create command for rotating cluster CA #10516

olemarkus · 2021-01-02T18:51:21Z

Pretty much following https://kubernetes.io/docs/tasks/tls/manual-rotation-of-ca-certificates/ but in an order that makes sense for kOps

olemarkus · 2021-02-04T08:31:19Z

/milestone v1.21

k8s-ci-robot · 2021-04-04T12:30:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from olemarkus after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

olemarkus · 2021-04-09T05:34:51Z

/cc @johngmyers @justinsb

johngmyers · 2021-04-10T04:34:31Z

My main concern with this is how the admin is supposed to recover if the command fails halfway through, for example with a power loss on the machine running kops.

I designed the similar WIP PR #9556 with smaller, atomic steps. In that PR, to fully distrust the previous keys one has to run the command three times, waiting enough time for tokens to be renewed between each run.

Like that PR, it might be easier to simplify by always keeping a "next CA" (and possibly "previous CA") in the trust list.

Another idea is to put a hash of the relevant CA set in userdata, so worker and/or control plane nodes will be detected as needing update by the rolling update code. (EDIT: Actually, this probably already happens as a side-effect of the changing config)

olemarkus · 2021-04-10T05:11:14Z

We rotate the cluster twice in this case, once with a CA bundle where both certs are trusted, and once with only one the new CA after everything trust the new CA.

I have not managed to end up in a state where the cluster is broken. Only manage to lock myself out by explicitly exporting admin credentials signed with the new CA too early, before the api servers trust it.

olemarkus · 2021-04-10T05:22:55Z

@johngmyers I see your PR is about SA signing token keys, which is much more messy as all api servers and all pods should rotate at exactly the same time. There is kubernetes/kubernetes#20165 but it is pretty much unresolved

johngmyers · 2021-04-10T05:48:41Z

@olemarkus my PR is about graceful rotation of SA signing token keys. It is not necessary for all API servers and pods to rotate at the same time.

Since kubernetes/kubernetes#20165 was filed the --service-account-key-file flag started taking multiple keys.

olemarkus · 2021-04-10T07:47:01Z

Ah wasn't aware of that. Thanks. I need to have a look on how that one works then.

johngmyers · 2021-04-12T05:57:26Z

@olemarkus Would you be willing to hold this for a while to give #11204 a chance to land first? I believe that puts down some infrastructure that would improve this PR.

olemarkus · 2021-04-12T09:04:45Z

Yeah. I like the ability to decide when to move to the next key rather than always using the highest number. I think that may require an additional roll of the control plane though, as on the first roll, the kube-controller-manager would not use the new keys. Unless we tell it to always use the latest cert even if it is not primary.

k8s-ci-robot · 2021-04-13T16:13:42Z

@olemarkus: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johngmyers · 2021-04-13T20:42:42Z

cmd/kops/rotate_ca.go

+	}
+
+	if !model.UseKopsControllerForNodeBootstrap(cluster) {
+		return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported")


Also need to disallow for !UseEtcdManager && UseEtcdTLS

johngmyers · 2021-04-13T20:54:17Z

cmd/kops/rotate_ca.go

+	}
+
+	//Update service accounts to trust old and new CA
+	err = rotateCAUpdateServiceAccounts(ctx, cluster, caBundle)


This needs to be done after staging the new cert to the control plane. Otherwise, between the time this runs and the time the control plane picks up the new cert, controller-manager could write a new service account secret containing only the old CA.

johngmyers · 2021-04-13T21:12:31Z

cmd/kops/rotate_ca.go

+	//Update masters. This will issue new certs for k8s using the new CA.
+	//New nodes, service accounts etc will use new CA
+	ruo.InstanceGroupRoles = []string{"master", "apiserver"}
+


While the new certs/keys currently take effect without an apply_cluster, this is undesirable and there are plans to move more of this stuff to userdata and/or versioned files in the store. We should ensure that changing either the set of trusted CAs or the active CA causes the control plane nodes to be treated as "NeedsUpdate". We should also allow for a future need to do an apply_cluster.

johngmyers · 2021-04-13T21:12:41Z

cmd/kops/rotate_ca.go

+
+	klog.Info("rotating the control plane")
+
+	//Update masters. This will issue new certs for k8s using the new CA.


Suggested change

//Update masters. This will issue new certs for k8s using the new CA.

// Update control plane. This will issue new certs for k8s using the new CA.

johngmyers · 2021-04-13T21:16:51Z

cmd/kops/rotate_ca.go

+		return fmt.Errorf("failed to rotate cluster: %v", err)
+	}
+
+	klog.Info("rotating the control plane")


The new certificate-authority-data needs to be propagated to all the kubeconfig files that live outside the cluster before proceeding to change the signing CA. This is part of the reason I don't think this should be an "all in one go" type of command.

johngmyers · 2021-04-13T21:32:10Z

cmd/kops/rotate_ca.go

+		return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported")
+	}
+
+	exportAdmin := clientConfig.Contexts[contextName].AuthInfo == contextName


This appears to be a bit too heuristic. It should check that the corresponding AuthInfo has nonempty ClientCertificateData

johngmyers · 2021-04-13T21:34:22Z

cmd/kops/rotate_ca.go

+	}
+
+	if len(pool.Secondary) > 0 {
+		klog.Info("Secondary CA cert already in the pool. Not issuing a new CA")


What if the previous run died while removing trust from the old CA?

johngmyers · 2021-04-13T22:50:18Z

Another reason I don't like an all-in-one command: We have a workload (rhymes with CrewBeeper) which frequently fails on a node rolling update. For this reason, we only do node rolling updates with a locally written controller. That controller implements local constraints, such as notifying the relevant workload's maintainers and delaying the update of their instance group until they can be on deck to fix the resulting failure of their workload. So we would not want the command that updates the keys to perform the rolling update.

olemarkus · 2021-06-25T06:47:12Z

/close

Closing in favour of the great work done by John!

k8s-ci-robot · 2021-06-25T06:47:18Z

@olemarkus: Closed this PR.

In response to this:

/close

Closing in favour of the great work done by John!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 2, 2021

k8s-ci-robot requested review from KashifSaadat and rdrgmnzs January 2, 2021 18:51

k8s-ci-robot added area/nodeup size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 2, 2021

This was referenced Jan 11, 2021

WIP Implement rotation of service-account signing key #9556

Closed

Certificate renewal #8774

Closed

Automate rotation of cluster secrets #10498

Closed

k8s-ci-robot added this to the v1.21 milestone Feb 4, 2021

olemarkus force-pushed the rotate-cert branch from eb9143b to 3848dfd Compare March 8, 2021 20:10

olemarkus modified the milestones: v1.21, v1.22 Mar 29, 2021

olemarkus force-pushed the rotate-cert branch from 3848dfd to 35ae159 Compare April 4, 2021 12:30

olemarkus force-pushed the rotate-cert branch from 35ae159 to 208d627 Compare April 4, 2021 12:48

k8s-ci-robot added the area/documentation label Apr 4, 2021

olemarkus modified the milestones: v1.22, v1.21 Apr 4, 2021

olemarkus force-pushed the rotate-cert branch from 208d627 to 966f77c Compare April 4, 2021 16:56

olemarkus changed the title ~~WIP: Create toolbox command for rotating cluster CA~~ Create toolbox command for rotating cluster CA Apr 4, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2021

olemarkus force-pushed the rotate-cert branch from 966f77c to db15e91 Compare April 4, 2021 17:05

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 4, 2021

Add command for rotating cluster CA

37ecac7

olemarkus force-pushed the rotate-cert branch from db15e91 to 37ecac7 Compare April 4, 2021 17:48

olemarkus changed the title ~~Create toolbox command for rotating cluster CA~~ Create command for rotating cluster CA Apr 4, 2021

k8s-ci-robot requested review from johngmyers and justinsb April 9, 2021 05:34

hakman requested review from hakman and removed request for rdrgmnzs and KashifSaadat April 9, 2021 05:51

hakman added the kind/office-hours label Apr 9, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2021

johngmyers requested changes Apr 13, 2021

View reviewed changes

k8s-ci-robot assigned johngmyers Apr 13, 2021

johngmyers reviewed Apr 13, 2021

View reviewed changes

johngmyers mentioned this pull request Apr 13, 2021

WIP Implement rotation of service-account signing key (take two) #11204

Closed

johngmyers removed the kind/office-hours label Apr 23, 2021

olemarkus modified the milestones: v1.21, v1.22 Apr 24, 2021

johngmyers mentioned this pull request May 6, 2021

Quote grep patterns in docs/rotate-secrets.md #10656

Merged

k8s-ci-robot closed this Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create command for rotating cluster CA #10516

Create command for rotating cluster CA #10516

olemarkus commented Jan 2, 2021 •

edited

Loading

olemarkus commented Feb 4, 2021

k8s-ci-robot commented Apr 4, 2021

olemarkus commented Apr 9, 2021

johngmyers commented Apr 10, 2021 •

edited

Loading

olemarkus commented Apr 10, 2021

olemarkus commented Apr 10, 2021

johngmyers commented Apr 10, 2021

olemarkus commented Apr 10, 2021

johngmyers commented Apr 12, 2021

olemarkus commented Apr 12, 2021

k8s-ci-robot commented Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers Apr 13, 2021

johngmyers commented Apr 13, 2021

olemarkus commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021


		klog.Info("rotating the control plane")

		//Update masters. This will issue new certs for k8s using the new CA.

	//Update masters. This will issue new certs for k8s using the new CA.
	// Update control plane. This will issue new certs for k8s using the new CA.

Create command for rotating cluster CA #10516

Create command for rotating cluster CA #10516

Conversation

olemarkus commented Jan 2, 2021 • edited Loading

olemarkus commented Feb 4, 2021

k8s-ci-robot commented Apr 4, 2021

olemarkus commented Apr 9, 2021

johngmyers commented Apr 10, 2021 • edited Loading

olemarkus commented Apr 10, 2021

olemarkus commented Apr 10, 2021

johngmyers commented Apr 10, 2021

olemarkus commented Apr 10, 2021

johngmyers commented Apr 12, 2021

olemarkus commented Apr 12, 2021

k8s-ci-robot commented Apr 13, 2021

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers Apr 13, 2021

Choose a reason for hiding this comment

johngmyers commented Apr 13, 2021

olemarkus commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

olemarkus commented Jan 2, 2021 •

edited

Loading

johngmyers commented Apr 10, 2021 •

edited

Loading