Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oidc client auth provider: cache OpenID Connect clients to prevent reinitialization #38167

Merged

Conversation

ericchiang
Copy link
Contributor

@ericchiang ericchiang commented Dec 6, 2016

Prior to this change, the OIDC client auth provider synced the OIDC provider metadata with every transport initialization. When used with kubectl, the provider can be reinitialize dozens of times, each triggering a round trip to the OIDC auth server. The overhead makes this feature largely unusable, adding several seconds to every kubectl invocation, and we observe many users simply abandoning it for other workarounds.

This fix prevents the excessive round trips by keeping a global cache of clients similarly to the TLS transport logic in client-go at the result of a significant speedup. The caching also fixes data races when a token is refreshed.

closes #37876

cc @kubernetes/sig-auth @liggitt @jsloyer @mlbiam @philips

Caching added to the OIDC client auth plugin to fix races and reduce the time kubectl commands using this plugin take by several seconds.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 6, 2016
@k8s-reviewable
Copy link

This change is Reviewable

@ericchiang ericchiang changed the title oidc client auth provider: cache OpenID Connect clients to prevent re… oidc client auth provider: cache OpenID Connect clients to prevent reinitialization Dec 6, 2016
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Dec 6, 2016
@philips
Copy link
Contributor

philips commented Dec 6, 2016

Manually tested and I can happily confirm it reduces the number of GET requests to the openid-configuration URL from 60+ to 1:

$ kubectl get pods  -v=9 2>&1 | grep well-known
I1205 20:14:38.118650    7143 round_trippers.go:393] curl -k -v -XGET  https://tec.a.ifup.org:32000/identity/.well-known/openid-configuration
I1205 20:14:38.311009    7143 round_trippers.go:412] GET https://tec.a.ifup.org:32000/identity/.well-known/openid-configuration 200 OK in 192 milliseconds

@jsloyer
Copy link

jsloyer commented Dec 6, 2016

@philips. I have seen some requests taking 15 seconds with an OIDC provider, what did you observe an API call through kubectl taking with this change?

@philips
Copy link
Contributor

philips commented Dec 6, 2016

@jsloyer kubectl get pods returns in 0.5s. Cluster: AWS us-west-1 accessed from residential Comcast running Tectonic and Dex v2 on cluster backed by TPRs.

$ time kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
etcd-operator-217127005-1pi7l   1/1       Running   0          2d

real	0m0.503s
user	0m0.091s
sys	0m0.020s

@jsloyer
Copy link

jsloyer commented Dec 6, 2016

This is amazing, this is a huge speed up. I built and tested a kubectl client with this change

var (
backoff = wait.Backoff{
Duration: 1 * time.Second,
Factor: 2,
Jitter: .1,
Steps: 5,
}
cache = clientCache{cache: make(map[cacheKey]*oidc.Client)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache = &clientCache{cache: make(map[cacheKey]*oidc.Client)}

certAuthData, err = base64.StdEncoding.DecodeString(cfg[cfgCertificateAuthorityData])
if err != nil {
return nil, err
client, ok := cache.getClient(issuer, clientID, clientSecret)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return early and unnest?

if ok {
  return client, nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just enough logic after the if statement to make that annoying. Let me see if I can split it out to a helper or something.

if err != nil {
return nil, fmt.Errorf("error creating OIDC Client: %v", err)
}
cache.setClient(issuer, clientID, clientSecret, client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still have the potential to construct this multiple times if newOIDCAuthProvider is called from multiple threads... are we ok with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think we are. restclient.AuthProviderConfigPersister is kinda a harry abstraction for multiple calls to the underlying transport writing to kubeconfig. I think multiple clients running around might make it worse. We might even consider syncing around using that interface, e.g. put a mutex on the client itself. gah

@liggitt
Copy link
Member

liggitt commented Dec 6, 2016

a few nits, LGTM otherwise. tests would be good.

@liggitt liggitt added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Dec 6, 2016
@ericchiang
Copy link
Contributor Author

ericchiang commented Dec 14, 2016

Quick update on this PR. I've been diving a bit more into the implementation and it seems like there are some bigger issues here. For example the client doesn't sync around requesting refresh tokens and updating the kubeconfig. If it's used concurrently there's a very basic race there. Also the returned transport seems to be doing retries[0], which makes some bold assumptions about the server not consuming POST body data, and means we're not closing response bodies.

Trying to clean this stuff up along with this change.

cc @liggitt

[0]

wait.ExponentialBackoff(backoff, func() (bool, error) {

@ericchiang ericchiang force-pushed the oidc-client-auth-cache-provider branch from 32bada3 to 380c5e8 Compare December 15, 2016 01:32
@ericchiang
Copy link
Contributor Author

ericchiang commented Dec 15, 2016

@liggitt updated though this became a bit bigger of a PR.

Highlights:

  • Cache OpenID Connect clients to prevent reinitialization
  • Don't retry requests in the http.RoundTripper.
    • Don't rely on the server not reading POST bodies.
    • Don't leak response body FDs.
    • Formerly ignored any throttling requests by the server.
  • Determine if the id token's expired by inspecting it.
    • Similar to logic in golang.org/x/oauth2
  • Synchronize around refreshing tokens and persisting the new config.

Since we already checked the id token's expiry these changes actually don't enforce anything we haven't already. It's completely backward compatible but uses more sane methods for determining token validity and persisting the config.

For what it's worth I manually e2e tested this using dex.

@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 15, 2016
@ericchiang
Copy link
Contributor Author

@k8s-bot bazel test this

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GKE smoke e2e failed for commit 380c5e8ed062707fa97896e4bdada124a126b24b. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@ericchiang ericchiang force-pushed the oidc-client-auth-cache-provider branch 2 times, most recently from b69e9d2 to 17be41d Compare December 15, 2016 01:57
@ericchiang
Copy link
Contributor Author

@k8s-bot verify test this

@ericchiang ericchiang force-pushed the oidc-client-auth-cache-provider branch from 17be41d to c3edd4a Compare December 19, 2016 23:09
@ericchiang
Copy link
Contributor Author

cc @yifan-gu

@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit c3edd4a. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Copy link
Contributor

@yifan-gu yifan-gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericchiang Just some nits, LGTM

}
r2.Header.Set("Authorization", fmt.Sprintf("Bearer %s", token))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericchiang why we need to deepcopy header here?
https://play.golang.org/p/rxM2ogR1GS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To comply with the RoundTripper interface

RoundTrip should not modify the request...

https://golang.org/pkg/net/http/#RoundTripper

If we don't copy deepcopy the header then we're adding a bearer toke to the original header, modifying the original request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericchiang Gotcha.

defer c.mu.Unlock()
key := cacheKey{issuer, clientID, clientSecret}

if oldClient, ok := c.cache[key]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we want to return the old client?
if we want to use the old client when it's there we can always do getClient() first.

In fact, the newOIDCAuthProvider() function below, the setClient() is always returning the latest client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both routine A and B try to initialize at the same time, and A success and puts it in the cache, we want B to use A's client rather than the one it managed to create. We have the cache to always return the same client for each provider so it can guard the kubeconfig with the same mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a comment.

Copy link
Contributor

@yifan-gu yifan-gu Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericchiang I see, it's a testAndSet operation, maybe rename the func, or leave it as is with some comments, ok with either. Thanks 👍

@ericchiang
Copy link
Contributor Author

cc @kubernetes/sig-auth-misc anyone else interested in reviewing this? Without this kubectl is relatively unusable with this auth plugin if you're not sitting next to your cluster :)

// Stubbed out for testing.
now func() time.Time

// Mutex guards presisting to the kubeconfig file and allows synchronized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persisting

if err != nil {
return jose.JWT{}, fmt.Errorf("could not perist new tokens: %v", err)
idToken := jwt.Encode()
newCfg[cfgIDToken] = idToken
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be newCfg[cfgIDToken] = tokens.IDToken?

@liggitt
Copy link
Member

liggitt commented Dec 22, 2016

just a couple small things, LGTM

* Cache OpenID Connect clients to prevent reinitialization
* Don't retry requests in the http.RoundTripper.
  * Don't rely on the server not reading POST bodies.
  * Don't leak response body FDs.
  * Formerly ignored any throttling requests by the server.
* Determine if the id token's expired by inspecting it.
  * Similar to logic in golang.org/x/oauth2
* Synchronize around refreshing tokens and persisting the new config.
@ericchiang ericchiang force-pushed the oidc-client-auth-cache-provider branch from 7e44f2d to 13e6318 Compare December 22, 2016 22:19
@k8s-ci-robot
Copy link
Contributor

Jenkins GCE etcd3 e2e failed for commit 13e6318. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@ericchiang
Copy link
Contributor Author

@k8s-bot bazel test this

1 similar comment
@ericchiang
Copy link
Contributor Author

@k8s-bot bazel test this

@k8s-ci-robot
Copy link
Contributor

Jenkins Bazel Build failed for commit 13e6318. Full PR test history.

The magic incantation to run this job again is @k8s-bot bazel test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@philips
Copy link
Contributor

philips commented Dec 30, 2016

@k8s-bot bazel test this

Filed an issue about the empty result kubernetes/test-infra#1465

@philips
Copy link
Contributor

philips commented Dec 30, 2016

@k8s-bot etcd3 test this

@philips
Copy link
Contributor

philips commented Dec 30, 2016

OK, tests finally went green. LGTM at this point but I am not an owner.

@liggitt liggitt added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2017
@liggitt
Copy link
Member

liggitt commented Jan 4, 2017

LGTM

@ericchiang
Copy link
Contributor Author

Anyone know how to retrigger the kops build?

@ericchiang
Copy link
Contributor Author

@k8s-bot kops test this

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 39648, 38167, 39591, 39415, 39612)

@k8s-github-robot k8s-github-robot merged commit 17665a0 into kubernetes:master Jan 10, 2017
@erictune erictune added this to the v1.5 milestone Jan 20, 2017
@yifan-gu yifan-gu added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 20, 2017
@saad-ali saad-ali added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jan 26, 2017
k8s-github-robot pushed a commit that referenced this pull request Jan 31, 2017
…8167-upstream-release-1.5

Automatic merge from submit-queue

Automated cherry pick of #38167

Cherry pick of #38167 on release-1.5.

#38167: rework oidc client auth provider
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@ericchiang ericchiang deleted the oidc-client-auth-cache-provider branch April 28, 2017 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubectl auth providers are repeatedly reinitialized