-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create command for rotating cluster CA #10516
Conversation
/milestone v1.21 |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @johngmyers @justinsb |
My main concern with this is how the admin is supposed to recover if the command fails halfway through, for example with a power loss on the machine running kops. I designed the similar WIP PR #9556 with smaller, atomic steps. In that PR, to fully distrust the previous keys one has to run the command three times, waiting enough time for tokens to be renewed between each run. Like that PR, it might be easier to simplify by always keeping a "next CA" (and possibly "previous CA") in the trust list. Another idea is to put a hash of the relevant CA set in userdata, so worker and/or control plane nodes will be detected as needing update by the rolling update code. (EDIT: Actually, this probably already happens as a side-effect of the changing config) |
We rotate the cluster twice in this case, once with a CA bundle where both certs are trusted, and once with only one the new CA after everything trust the new CA. I have not managed to end up in a state where the cluster is broken. Only manage to lock myself out by explicitly exporting admin credentials signed with the new CA too early, before the api servers trust it. |
@johngmyers I see your PR is about SA signing token keys, which is much more messy as all api servers and all pods should rotate at exactly the same time. There is kubernetes/kubernetes#20165 but it is pretty much unresolved |
@olemarkus my PR is about graceful rotation of SA signing token keys. It is not necessary for all API servers and pods to rotate at the same time. Since kubernetes/kubernetes#20165 was filed the |
Ah wasn't aware of that. Thanks. I need to have a look on how that one works then. |
@olemarkus Would you be willing to hold this for a while to give #11204 a chance to land first? I believe that puts down some infrastructure that would improve this PR. |
Yeah. I like the ability to decide when to move to the next key rather than always using the highest number. I think that may require an additional roll of the control plane though, as on the first roll, the kube-controller-manager would not use the new keys. Unless we tell it to always use the latest cert even if it is not primary. |
@olemarkus: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
} | ||
|
||
if !model.UseKopsControllerForNodeBootstrap(cluster) { | ||
return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to disallow for !UseEtcdManager && UseEtcdTLS
} | ||
|
||
//Update service accounts to trust old and new CA | ||
err = rotateCAUpdateServiceAccounts(ctx, cluster, caBundle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be done after staging the new cert to the control plane. Otherwise, between the time this runs and the time the control plane picks up the new cert, controller-manager could write a new service account secret containing only the old CA.
//Update masters. This will issue new certs for k8s using the new CA. | ||
//New nodes, service accounts etc will use new CA | ||
ruo.InstanceGroupRoles = []string{"master", "apiserver"} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the new certs/keys currently take effect without an apply_cluster
, this is undesirable and there are plans to move more of this stuff to userdata and/or versioned files in the store. We should ensure that changing either the set of trusted CAs or the active CA causes the control plane nodes to be treated as "NeedsUpdate". We should also allow for a future need to do an apply_cluster
.
|
||
klog.Info("rotating the control plane") | ||
|
||
//Update masters. This will issue new certs for k8s using the new CA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//Update masters. This will issue new certs for k8s using the new CA. | |
// Update control plane. This will issue new certs for k8s using the new CA. |
return fmt.Errorf("failed to rotate cluster: %v", err) | ||
} | ||
|
||
klog.Info("rotating the control plane") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new certificate-authority-data
needs to be propagated to all the kubeconfig files that live outside the cluster before proceeding to change the signing CA. This is part of the reason I don't think this should be an "all in one go" type of command.
return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported") | ||
} | ||
|
||
exportAdmin := clientConfig.Contexts[contextName].AuthInfo == contextName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be a bit too heuristic. It should check that the corresponding AuthInfo
has nonempty ClientCertificateData
} | ||
|
||
if len(pool.Secondary) > 0 { | ||
klog.Info("Secondary CA cert already in the pool. Not issuing a new CA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the previous run died while removing trust from the old CA?
Another reason I don't like an all-in-one command: We have a workload (rhymes with CrewBeeper) which frequently fails on a node rolling update. For this reason, we only do node rolling updates with a locally written controller. That controller implements local constraints, such as notifying the relevant workload's maintainers and delaying the update of their instance group until they can be on deck to fix the resulting failure of their workload. So we would not want the command that updates the keys to perform the rolling update. |
/close Closing in favour of the great work done by John! |
@olemarkus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #8774
Fixes #10498
Fixes #1020
Pretty much following https://kubernetes.io/docs/tasks/tls/manual-rotation-of-ca-certificates/ but in an order that makes sense for kOps