Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rotating secrets docs #8948

Merged
merged 1 commit into from
Apr 22, 2020

Conversation

olemarkus
Copy link
Member

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 21, 2020
You need to reboot every node (using a rolling-update). You have to use `--cloudonly` because the keypair no longer matches.

```
kops rolling-update cluster --cloudonly --force --yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested these instructions, but does the entire cluster need to be rolled or only the masters? Since the apiserver will essentially be down from the time you start deleting keys until the majority of them are rebooted, perhaps keeping the --master-interval=10s there would help minimize downtime.

If nodes do need to be replaced as well, perhaps we break it out into two rolling-update commands with only the masters using --cloudonly and the nodes being more graceful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost access to the API when I deleted all the secrets. I think all the nodes need to download the new certs and re-register to the cluster. I don't know if a reboot would do that, which could potentially be less disruptive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so even after replacing just the masters, both you and the nodes still couldn't connect to the API. in that case it probably makes sense to rolling-update every master and node in the cluster all in one go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps keep the --master-interval=10s ? no point in rolling those slowly if they're inaccessible. I do see an advantage of keeping the default node-interval, since it allows pods to be replaced gradually rather than having all of the nodes down at once. we wouldn't respect pod disruption budgets but at least 1 nodes' pods being lost at a time is better than all. Maybe we keep --master-interval=10s and mention that users may want to set a shorter --node-interval depending on their workloads and tolerance of downtime. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least not keep the master interval. The masters are not really doing anything useful: dns, api server etc will all be broken. There is nothing on the masters that work without the API and the API won't work until etcd is working again. I think you want to roll as fast as possible so new nodes can register (which requires quorum).

Masters typically take 5 min to roll. Rolling nodes faster than that won't really work since the nodes won't register with the masters until they are all up.

Nodes typically take 2 min to roll. We could say that it may be wise to wait 2-3 min to roll the nodes to minimize workload downtime.

Until we find a way to rotate this more automatic, I think one have to live with quite a bit of downtime here anyway though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just to clarify, are you proposing the docs should include --master-interval=10s or to omit it and let it use the kops default? I think it would make sense to roll the masters as fast as possible since none of them are functional until the majority of them have been replaced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from user perspective it is easiest to either roll everything at once. Depending on workload one may want to do things differently, but you'd save 2-3 min out of maybe 10 min downtime. So I would leave it as-is in the docs.
It would be a fun exercise to look at how we can automate and do this more gracefully though.
By not rolling ca and keeping that key more safe than the other cert (maybe not even having it in S3) would save a lot of pain since the masters should accept the new certs immediately. Don't think you would have to roll all tokens this way either.

pkill -f kube-controller-manager
kubectl delete pods --all --all-namespaces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the old pods still functional? I suppose any that depend on their service tokens no longer work, but other pods would still be functional. I'm mostly thinking about how we can be as least disruptive as necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pods that are not using the SA token is functional.

But we just rolled --cloudonly the entire cluster, so I figured this would be easier and faster than the user having to look for pods that may be using the API or using the token for other purposes.

@rifelpet
Copy link
Member

ok, this looks good and if we can figure out a less invasive way to perform the rotating we can always update this later. thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2020
@k8s-ci-robot k8s-ci-robot merged commit b62c01f into kubernetes:master Apr 22, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Apr 22, 2020
@olemarkus olemarkus deleted the docs-secret-rotate branch April 25, 2020 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants