Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to reload client certificates from disk #79083

Merged

Conversation

@jackkleeman
Copy link
Member

jackkleeman commented Jun 16, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR is intended to allow client certificate rotation from disk. In future it can apply to all certificate callbacks, but for now it handles cert/key file rotation only. The complexity mainly arises from trying to handle the case where a certificate expires for an open connection, which is a real issue given that the apiserver verifies on every request, and connections tend to be highly reused.

  1. Change the default behaviour when certificate files are specified, to get certificates from a function which loads the cert off disk (at most every minute). We may want to introduce a feature gate or to otherwise control this behaviour, although I think it's a fairly safe default - we don't guarantee that we'll only read the certificate on first boot.
  2. Add generic behaviour to the transport creation functions when using a certificate files, which notes any change in the returned certificate, closing all connections when doing so. Additionally, we check for new certificates every 5 minutes. Together these two behaviours should cover the possibility of certs expiring while connections are open, causing unauthorised errors.

Which issue(s) this PR fixes:

Fixes #4672

Special notes for your reviewer:
This approach was discussed briefly in the June 12th 2019 sig auth meeting - watching that could be useful context. This is only one possible approach (eg, we could use a custom roundtripper instead) and I am very open to ideas.

Does this PR introduce a user-facing change?:

When client certificate files are provided, reload files for new connections, and close connections when a certificate changes.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 16, 2019

Hi @jackkleeman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Jun 16, 2019

/assign @smarterclayton

@@ -89,6 +89,9 @@ type AuthInfo struct {
// ClientKeyData contains PEM-encoded data from a client key file for TLS. Overrides ClientKey
// +optional
ClientKeyData []byte `json:"client-key-data,omitempty"`
// ReloadCertFromDisk causes the certificate specified by ClientCertificate and ClientKey to be periodically reloaded from disk.

This comment has been minimized.

Copy link
@smarterclayton

smarterclayton Jun 17, 2019

Contributor

Why would we not infer this behavior if ClientKey is set but not ClientKeyData? That would remove the need for a new flag.

This comment has been minimized.

Copy link
@jackkleeman

jackkleeman Jun 17, 2019

Author Member

That's totally possible. The reason I didn't is so that users need to request this new behaviour. I was concerned we could break some existing setups if we start reloading from disk. But I guess it can go in the release notes - happy to make that change if you want

This comment has been minimized.

Copy link
@jackkleeman

jackkleeman Jun 17, 2019

Author Member

@smarterclayton I have changed such that we default to the rotation behaviour whenever files are provided in rest.Config, meaning that we don't affect kubeconfig parsing at all. We could settle on a middle ground where there's a flag in rest.Config, but frankly I think its sensible if long term the data fields are the 'static' way to provide certificates, and providing files means you are accepting that the files will be reloaded.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jun 17, 2019

How is this change expected to alter the Kubelet's client cert rotation logic?

@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Jun 17, 2019

This change deliberately does not affect the kubelet rotation logic at all, as that logic builds its own custom transport.

I suggest that once this is merged we port the kubelet logic to only handle writing cert files to disk, and then use the transport and the reloading logic in this PR. But I thought best to keep that to a separate PR.

@jennybuckley

This comment has been minimized.

Copy link
Contributor

jennybuckley commented Jun 17, 2019

/cc @deads2k

@dims

This comment has been minimized.

Copy link
Member

dims commented Jul 23, 2019

/ok-to-test
/assign @liggitt
/milestone v1.16

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jul 23, 2019
@jackkleeman jackkleeman force-pushed the jackkleeman:client-cert-file-reload branch from 5373b95 to d611ade Jul 23, 2019
@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Jul 23, 2019

/assign @liggitt

@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Jul 24, 2019

/retest

@xmudrii

This comment has been minimized.

Copy link
Member

xmudrii commented Aug 12, 2019

@jackkleeman Hello! I'm bug triage lead for the 1.16 release cycle and considering this PR has not been updated for some time, I'd like to check what's the status of this PR. The code freeze is starting 29th August (about 2.5 weeks from now) and while there is plenty of time, we want to ensure that each PR has a chance to be merged on time.

As the PR is tagged for 1.16, is it still planned for this release?

@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Aug 15, 2019

@xmudrii I think we are blocked on review and have been for a while. Unless someone picks it up it's unlikely to be approved by then

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Aug 30, 2019

This will slip to 1.17. I'll have review bandwidth after 1.16 freeze.

@k8s-ci-robot k8s-ci-robot added the size/L label Feb 29, 2020
@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Feb 29, 2020

I've decided to remove the logic that handles rotation for arbitrary GetCert callbacks. This complicates this PR, increases the risk of a breaking change, and doesn't have an immediate use case. Reverting 4f49d9f will be sufficient to handle this in a later PR

@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Mar 1, 2020

/retest

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Mar 1, 2020

@jackkleeman Can you squash the commit history?

…lows the provided client certificate files to be reloaded from disk (currently on every use)

Close outbound connections when using a cert callback and certificates rotate. This means that we won't get into a situation where we have open TLS connections using expires certs, which would get unauthorized errors at the apiserver

Attempt to retrieve a new certificate if open connections near expiry, to prevent the case where the cert expires but we haven't yet opened a new TLS connection and so GetClientCertificate hasn't been called.

Move certificate rotation logic to a separate function

Rely on generic transport approach to handle closing TLS client connections in exec plugin; no need to use a custom dialer as this is now the default behaviour of the transport when faced with a cert callback. As a result of handling this case, it is now safe to apply the transport approach even in cases where there is a custom Dialer (this will not affect kubelet connrotation behaviour, because that uses a custom transport, not just a dialer).

Check expiry of the full TLS certificate chain that will be presented, not only the leaf. Only do this check when the certificate actually rotates. Start the certificate as a zero value, not nil, so that we don't see a rotation when there is in fact no client certificate

Drain the timer when we first initialize it, to prevent immediate rotation. Additionally, calling Stop() on the timer isn't necessary.

Don't close connections on the first 'rotation'

Remove RotateCertFromDisk and RotateClientCertFromDisk flags.

Instead simply default to rotating certificates from disk whenever files are exclusively provided.

Add integration test for client certificate rotation

Simplify logic; rotate every 5 mins

Instead of trying to be clever and checking for rotation just before an
expiry, let's match the logic of the new apiserver cert rotation logic
as much as possible. We write a controller that checks for rotation
every 5 mins. We also check on every new connection.

Respond to review

Fix kubelet certificate rotation logic

The kubelet rotation logic seems to be broken because it expects its
cert files to end up as cert data whereas in fact they end up as a
callback. We should just call the tlsConfig GetCertificate callback
as this obtains a current cert even in cases where a static cert is
provided, and check that for validity.

Later on we can refactor all of the kubelet logic so that all it does is
write files to disk, and the cert rotation work does the rest.

Only read certificates once a second at most

Respond to review

1) Don't blat the cert file names
2) Make it more obvious where we have a neverstop
3) Naming
4) Verbosity

Avoid cache busting

Use filenames as cache keys when rotation is enabled, and add the
rotation later in the creation of the transport.

Caller should start the rotating dialer

Add continuous request rotation test

Rebase: use context in List/Watch

Swap goroutine around

Retry GETs on net.IsProbableEOF

Refactor certRotatingDialer

For simplicity, don't affect cert callbacks

To reduce change surface, lets not try to handle the case of a changing
GetCert callback in this PR. Reverting this commit should be sufficient
to handle that case in a later PR.

This PR will focus only on rotating certificate and key files.
Therefore, we don't need to modify the exec auth plugin.

Fix copyright year
@jackkleeman jackkleeman force-pushed the jackkleeman:client-cert-file-reload branch from 405f030 to 929b155 Mar 2, 2020
@jackkleeman jackkleeman requested a review from mikedanese Mar 2, 2020
@jackkleeman

This comment has been minimized.

Copy link
Member Author

jackkleeman commented Mar 2, 2020

/retest

}()

for range time.Tick(time.Second) {
_, err := client.CoreV1().ServiceAccounts("default").List(ctx, v1.ListOptions{})

This comment has been minimized.

Copy link
@deads2k

deads2k Mar 2, 2020

Contributor

context worked out quite nicely here.

return func() (*tls.Certificate, error) {
currentMtx.RLock()
if current.isStale() {
currentMtx.RUnlock()

This comment has been minimized.

Copy link
@mikedanese

mikedanese Mar 2, 2020

Member

Using a regular Mutex would make this a lot simpler.:

mu.Lock()
defer mu.Unlock()
if current.isStale() {
  current = refresh()
}
return current

Given this is only required on connection initiation, I doubt there'll be significant benifit from using a RWMutex here. Unlocks that aren't guarded to the scope by defers will deadlock if anything panics in the function you are calling (which has happened before in Go cert parsing).

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Mar 2, 2020

the simplification helped me.

/lgtm
/hold

holding in case @liggitt or @mikedanese has anything to add.

EDIT: let's say comments by wednesday.

@deads2k

This comment has been minimized.

Copy link
Contributor

deads2k commented Mar 2, 2020

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 2, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jackkleeman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 2, 2020

I won't have a chance to review this before freeze. I'll defer to @deads2k's review

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Mar 2, 2020

I scanned it this morning and saw nothing that should block. Ill dig in post submit when I get a chance.

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit 4b1ee39 into kubernetes:master Mar 3, 2020
11 of 17 checks passed
11 of 17 checks passed
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-e2e-kind Job triggered.
Details
pull-kubernetes-integration Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
pull-kubernetes-verify Job triggered.
Details
cla/linuxfoundation jackkleeman authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-kind-ipv6-parallel Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 3, 2020

@jackkleeman: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce 929b155 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jackkleeman jackkleeman deleted the jackkleeman:client-cert-file-reload branch Mar 3, 2020
defer c.queue.ShutDown()

klog.Infof("Starting client certificate rotation controller")
defer klog.Infof("Shutting down client certificate rotation controller")

This comment has been minimized.

Copy link
@neolit123

neolit123 Mar 3, 2020

Member

these lines introduced non-verbose output to kubectl auth:

kubectl auth can-i get --kubeconfig=/etc/kubernetes/kubelet.conf --namespace=kube-system configmaps/kube-proxy
I0303 23:14:14.609921    7176 cert_rotation.go:137] Starting client certificate rotation controller
yes

potentially breaking consumers of CombinedOutput looking for just "yes".

unless there are immediate objections i'm going to send a PR to .V(1) the log messages.

This comment has been minimized.

Copy link
@liggitt

liggitt Mar 3, 2020

Member

Definitely. I'd probably call it v3

const workItemKey = "key"

// CertCallbackRefreshDuration is exposed so that integration tests can crank up the reload speed.
var CertCallbackRefreshDuration = 5 * time.Minute

This comment has been minimized.

Copy link
@tnozicka

tnozicka Mar 6, 2020

Contributor

Why was the default chosen so high? The IO and CPU load would surely withstand a more responsive default like 15s / 30s. If I change a cert that is live reloaded I'd expect it to manifest way sooner then 5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.