New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically reload kube-aggregator certificates #88120
Conversation
} | ||
|
||
// DynamicRestConfigProvider provides a restconfig and transport backed by dynamically loaded cert/key pair | ||
type DynamicRestConfigProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably deserves some tests :) I'll add those after initial review validating the direction of this PR
/assign @p0lyn0mial |
😂 I'll check out the failures, I have run only unit tests locally :) |
What happens to the old cert? It is still valid, right? |
|
||
var _ dynamiccertificates.Listener = &DynamicRestConfigProvider{} | ||
|
||
func NewDynamicRestConfigProvider(servingContent dynamiccertificates.CertKeyContentProvider, egressSelector *egressselector.EgressSelector, proxyTransport *http.Transport, insecure bool, serverName string, caBundle []byte) (*DynamicRestConfigProvider, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has to be dynamic, serverName, caBundle
are not constant
, what if they have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on my understanding, these are provided when kube-aggregator starts so they cannot change without the process being restarted. But I might have missed something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check when AddAPIService
function is called - https://github.com/kubernetes/kubernetes/pull/88120/files#r379316414
} | ||
|
||
func (c *DynamicRestConfigProvider) syncRestConfig() error { | ||
cert, key := c.certKeyPair.CurrentCertKeyContent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will create a new config every second even though nothing has changed, are they any drawback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I should probably add checking whether the value changed, good point. I'll change that
@@ -154,13 +138,26 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) { | |||
newReq, cancelFn := newRequestForProxy(location, req) | |||
defer cancelFn() | |||
|
|||
if handlingInfo.proxyRoundTripper == nil { | |||
roundTripper, err := handlingInfo.dynamicRestConfig.GetTransport() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a hot path, now it depends on an external component, for example, it can return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions how to resolve this?
} | ||
|
||
// DynamicRestConfigProvider provides a restconfig and transport backed by dynamically loaded cert/key pair | ||
type DynamicRestConfigProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking if we really need a controller, what would be an alternative design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I am open to other ideas, I pretty much kept this in line with how are similar problems solved in apiserver (so the dynamiccertificates package)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
- Adding a function to
APIAggregator
that would return acert
and akey
- Adding a new method to
APIAggregator
that would callupdateAPIService
for allproxyHandlers
and would use a func from1
. - Wire
Notify
method to call the new function from2
Alternatively:
- Adding a new method to
APIAggregator
that would callupdateAPIService
for allproxyHandlers
and would accept a cert and a key - Wire
Notify
method to call the new function from1
and providing a cert and a key
Doing this that way we don't modify the hot path and we take potential changes in services into account. It should work if the old cert is still valid. What do you think?
@sttts FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p0lyn0mial that solves it for proxy_handlers, and then available_controller will have to implement the exact same thing (that's why I went the way of extracting that logic outside, so both places can reuse it).
If we go and extract it then the only difference between your proposal and current implementation is whether it uses the workqueue or not. I like reusing workqueue since it's already proven piece of code that solves the async update for me.
No hard opinion here though, just putting out here why I went the way I went. I am totally fine shifting the direction :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that available_controller
only needs a restConfig
- slightly different one than proxy_handlers
. I think that before it didn't use MTLS
, now it does. I don't know why it was like that before but it should work with MTLS
as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, let's try with the controller, left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not use mtls as I pass insecure=true here https://github.com/kubernetes/kubernetes/pull/88120/files#diff-d92af716cdd9489e3f57d61b53adad72R216 ... should I make that more obvious somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got the confusion around "controller" now, the Run method I think is definitely something we don't need, it was even unused. I removed it, so now it really just asynchronously processes notifications.
if err != nil { | ||
return err | ||
} | ||
discoveryClient := &http.Client{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating an http client on every sync
doesn't seem right, we should reuse it if restConfig
hasn't changed.
} | ||
|
||
// DynamicRestConfigProvider provides a restconfig and transport backed by dynamically loaded cert/key pair | ||
type DynamicRestConfigProvider struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that available_controller
only needs a restConfig
- slightly different one than proxy_handlers
. I think that before it didn't use MTLS
, now it does. I don't know why it was like that before but it should work with MTLS
as well, right?
// start timer that rechecks every minute, just in case. this also serves to prime the controller quickly. | ||
go wait.Until(func() { | ||
c.Enqueue() | ||
}, 1*time.Minute, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to call it every minute if it already supports dynamiccertificates.Notifier
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no :) I actually think we might not need the whole Run method 🤔 removing it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the underlying dynamic file read doesn't support "notification" ?:) I think we need to cover both paths.
@@ -301,14 +304,18 @@ func (s *APIAggregator) AddAPIService(apiService *v1.APIService) error { | |||
proxyPath = "/api" | |||
} | |||
|
|||
restConfigProvider, err := util.NewDynamicRestConfigProvider(s.certKeyContentProvider, s.egressSelector, s.proxyTransport, apiService.Spec.InsecureSkipTLSVerify, apiService.Spec.Service.Name + "." + apiService.Spec.Service.Namespace + ".svc", apiService.Spec.CABundle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to stop it when RemoveAPIService
was called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good point... but right now when I removed the Run method it's actually just implementing Notifier, so there's nothing to stop anymore, correct?
The only think that's "running" right now is the dynamic file reader and that is reused across all restconfig providers so that should stop with the whole process going away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm okay, but we still use up the memory we should clean up, wdyt?
7ee8963
to
7e57a28
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alenkacz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @deads2k |
cf9c206
to
4f4c506
Compare
523f48c
to
f6163ad
Compare
393d147
to
a1ef7fc
Compare
a1ef7fc
to
425e744
Compare
/retest |
425e744
to
3ca3ae2
Compare
3ca3ae2
to
6ed887b
Compare
/retest |
@@ -177,11 +176,21 @@ func (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.Deleg | |||
5*time.Minute, // this is effectively used as a refresh interval right now. Might want to do something nicer later on. | |||
) | |||
|
|||
var certKeyContentProvider *dynamiccertificates.DynamicCertKeyPairContent | |||
if len(c.ExtraConfig.ProxyClientCertFile) > 0 && len(c.ExtraConfig.ProxyClientKeyFile) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there isn't a client cert/key pair, then the front proxy fails and we should return an error, right?
@@ -234,6 +243,12 @@ func (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.Deleg | |||
c.GenericConfig.SharedInformerFactory.Start(context.StopCh) | |||
return nil | |||
}) | |||
if certKeyContentProvider != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect certKeyContentProvider to be allowed to be nil.
@@ -154,13 +144,19 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) { | |||
newReq, cancelFn := newRequestForProxy(location, req) | |||
defer cancelFn() | |||
|
|||
if handlingInfo.proxyRoundTripper == nil { | |||
roundTripper, err := transport.New(handlingInfo.transportConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously, we re-used the same transport for all requests instead of building a fresh one. Does this carry performance penalties in terms of object creation and GC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, building this per request is expensive, especially when we set options (see comment on GetCert usage) that means every time updateAPIService
is called we are guaranteed to no longer use the previously cached TLS config.
TLS: transport.TLSConfig{ | ||
Insecure: apiService.Spec.InsecureSkipTLSVerify, | ||
GetCert: func() (*tls.Certificate, error) { | ||
cert, key := r.certKeyContentProvider.CurrentCertKeyContent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we would do this for every negotiation, but we want is only to reload this if there was a change, right?
I'd be supportive of a change that adds a GetCert
compatible function to the certKeyContentProvider to allow directly wiring the option here. This way we can write it to be aware of when it has changed.
} | ||
transportConfig.Dial = egressDialer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we do this if the previous call returned an error?
if err != nil { | ||
return nil, err | ||
} | ||
transportConfig.Dial = egressDialer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we use this value if the previous was nil
@@ -95,28 +94,16 @@ func createAggregatorConfig( | |||
return nil, err | |||
} | |||
|
|||
var certBytes, keyBytes []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make sure the files are readable at startup? a missing file error is easier to diagnose than weird anonymous request errors because the certs couldn't load in the background
transportConfig := &transport.Config{ | ||
TLS: transport.TLSConfig{ | ||
Insecure: apiService.Spec.InsecureSkipTLSVerify, | ||
GetCert: func() (*tls.Certificate, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every time we create a GetCert function, we create a unique config that busts the cache of reusable client configs (the cache key for this is computed based on the function pointer address, see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/transport/cache.go#L134)
in contrast, the previous version set static options that reused the cached tls config (and the existing kept-alive backend connections)
The TLS config cache has no size limit, which means this is effectively introducing a slow memory leak if APIService conditions flutter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm does the pointer address change though? 🤔This is implemented with the idea/impression that it does not change, am I wrong?
To test this, I wrote a following test:
- created a setup similar to the TestProxyCertReload
- checked the address in cache keey after calling updateAPIService
- wrote new certs to disk
- called updateAPIService again
- checked the address in cache
The address was the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt we could extend certKeyContentProvider
to provide GetCert
function so that the address stays the same and change the cache to take into account the content, not the address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened 8520587 to demonstrate that. It still creates a new transport for each request. Most of the time the transport will be retrieved from the cache but even that incurs a small performance cost of acquiring a lock.
We could move the code back to updateAPIService
method and reload the handler on the certificate change. In that case, this approach seems to be more complicated than #92791.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt I've opened adcd6e1 that demonstrates the key returned from the cache is the same as long as the address of
GetCert
function doesn't change
hmm... if a function with the same pointer address can return different data to different callers, that means some of the existing ways we build the client config cache could have problems.
@liggitt we could extend certKeyContentProvider to provide GetCert function so that the address stays the same and change the cache to take into account the content, not the address.
Wouldn't that compute the cache key based on the initial cert content, which would conflate client configurations that should be kept distinct? For example, config A and config B could point to different file paths /a.ca
and /b.ca
which happen to contain the same content at startup... if those are being reloaded dynamically, we wouldn't want to make those share a client, since at a later point in time, those files' content could diverge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... if a function with the same pointer address can return different data to different callers, that means some of the existing ways we build the client config cache could have problems.
comparing the address of a function doesn't seem right since golang forbids comparing the addresses of functions in general.
would you accept a patch that changes this behavior? Ideally, if we could serialize getCert
and compare the bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparing the address of a function doesn't seem right since golang forbids comparing the addresses of functions in general.
yeah, I'm not thrilled with the approach... the places we do that currently are for non-serializable injected implementations.
would you accept a patch that changes this behavior? Ideally, if we could serialize getCert and compare the bytes.
we'd need a way to determine the identity / compare instances of the dynamic methods (GetCert, Dial, and Proxy)
@@ -154,13 +144,19 @@ func (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) { | |||
newReq, cancelFn := newRequestForProxy(location, req) | |||
defer cancelFn() | |||
|
|||
if handlingInfo.proxyRoundTripper == nil { | |||
roundTripper, err := transport.New(handlingInfo.transportConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, building this per request is expensive, especially when we set options (see comment on GetCert usage) that means every time updateAPIService
is called we are guaranteed to no longer use the previously cached TLS config.
@alenkacz: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@p0lyn0mial: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
What this PR does / why we need it:
Dynamically load certificates for kube-aggregator
Which issue(s) this PR fixes:
Fixes #87766
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: