Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery is throttled when there are lots of resources (CRDs) #1126

Closed
seans3 opened this issue Oct 5, 2021 · 11 comments · Fixed by kubernetes/kubernetes#105520
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@seans3
Copy link
Contributor

seans3 commented Oct 5, 2021

From k/k: kubernetes/kubernetes#105489

The default client side rate limit is 50qps with a burst of 100.

Kubectl does a discovery pass before doing the operation requested. Discovery involves (some combination of) reading the OpenAPI spec and the "homegrown" discovery information for the purpose of e.g. mapping kinds to resources. Sometimes these can be cached, but even if it is cached, an API call can be required to verify it hasn't changed.

This means that if there are more than 100 group versions, we will hit the client side rate limit in the process of discovery, possibly significantly slowing down the process.

AFAIK there is no way to adjust this limit when using kubectl.

A user report can be found here: kubernetes/kubernetes#101634 (comment)

An easy way to fix this is to disable the client side limit in kubectl. Possibly caching parameters could be adjusted, too. A difficult fix would be to load the minimum amount of discovery documents necessary to perform the task at hand (but some tasks require everything).

@seans3 seans3 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2021
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 5, 2021
@seans3
Copy link
Contributor Author

seans3 commented Oct 5, 2021

/assign @lavalamp
/assign @justinsb
/assign

@seans3
Copy link
Contributor Author

seans3 commented Oct 5, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 5, 2021
@seans3
Copy link
Contributor Author

seans3 commented Oct 5, 2021

@justinsb commented on original bug:

Two contributory things:

I think there's a bug where kubectl get clusterrolebindings.rbac.authorization.k8s.io forces a cache invalidation and full rediscovery, but ``kubectl get clusterrolebindings` does not. The gotcha is that tab completion uses the full name, and of course you're more likely to use tab completion when you have lots of CRDs, making discovery more expensive.

In general, we shouldn't have to do a full discovery to map GVK -> GVR. In the vast majority of cases I think we should be able to just do type discovery in the group and then find the K -> R. (Daniel Smith shared something about Openshift clusters having the G in GVK not equal to the G in GVR, but I would hope we could address that with a fallback). This would be great for controllers, but I don't think it helps for kubectl where we often need shortnames from the apiserver (and in general, kubectl is going to use more of the surface than a controller which only needs GVK -> GVR mapping, so a full discovery is more justified ... ideally we wouldn't invalidate it every time the user typed though!)

@justinsb
Copy link
Member

justinsb commented Oct 5, 2021

Copied from the original bug:

Two contributory things:

  • I think there's a bug where kubectl get clusterrolebindings.rbac.authorization.k8s.io forces a cache invalidation and full rediscovery, but kubectl get clusterrolebindings does not. The gotcha is that tab completion uses the full name, and of course you're more likely to use tab completion when you have lots of CRDs, making discovery more expensive.

  • In general, we shouldn't have to do a full discovery to map GVK -> GVR. In the vast majority of cases I think we should be able to just do type discovery in the group and then find the K -> R. (@lavalamp shared something about Openshift clusters having the G in GVK not equal to the G in GVR, but I would hope we could address that with a fallback). This would be great for controllers, but I don't think it helps for kubectl where we often need shortnames from the apiserver (and in general, kubectl is going to use more of the surface than a controller which only needs GVK -> GVR mapping, so a full discovery is more justified ... ideally we wouldn't invalidate it every time the user typed <tab> though!)

@negz
Copy link

negz commented Oct 28, 2021

I just came here to raise this same issue. Over in @crossplane land we'd like to be able to install a lot of CRDs. You can think of Crossplane providers as controller managers that reconcile custom resources representing some set of "external" (to the API server where Crossplane is running) resources - e.g. SQL databases or GCP resources. The long tail of our providers are pretty small (<10 custom resource definitions), but the big ones currently correspond to entire cloud providers (similar to Google KCC) and are thus in the order of 700 custom resource definitions. Hypothetically if someone wanted to install the AWS, GCP, and Azure providers all at the same time that would be around 2,000 CRDs.

We've considered breaking up our providers since it's unlikely folks are actually using even half of (e.g) all 700 AWS APIs but the only real impetus for doing so is avoiding performance bottlenecks. Alongside expensive OpenAPI processing (kubernetes/kubernetes#105932) lengthy discovery processes are the next most annoying consequence of having many CRDs; we're seeing discovery sometimes take minutes.

@soltysh
Copy link
Contributor

soltysh commented Oct 29, 2021

As discussed in kubernetes/kubernetes#105520 we're disabling the discovery burst in kubectl for now

@negz
Copy link

negz commented Oct 29, 2021

Thanks @soltysh! Can you help me understand how this is likely to be released - is it reasonable to expect it to be backported to previous versions of kubectl? It's not intuitive to me because on one hand it's a kind of a bug fix, while on the other hand it's quite a big behaviour change (and kind of depends on APF).

@soltysh
Copy link
Contributor

soltysh commented Nov 2, 2021

@negz this will be part of 1.23 release, and as you're saying I wouldn't personally consider this a backport-able bug.

@kbudde
Copy link

kbudde commented Jan 14, 2022

@negz , @soltysh as far as I can see the limit should have been upgraded to 300 (#105520) but the fix doesn't work for 1.23 (see ( #107131 ). Now it's part of release 1.24.
What is the process to get this change backported into older version?

@chlunde
Copy link

chlunde commented Jan 24, 2022

@kbudde https://github.com/kubernetes/sig-release/blob/master/release-engineering/role-handbooks/patch-release-team.md#cherry-pick-requests documents the process.

@justinsb is there another open issue for this, or is it just inside a comment in this closed issue? "I think there's a bug where kubectl get clusterrolebindings.rbac.authorization.k8s.io forces a cache invalidation and full rediscovery..." Personally I would hope that a full name to just fetch information for the single resource type and always be fast, even when the cache, unless it's possible for a shortname or category to also call itself clusterrolebindings.rbac.authorization.k8s.io.

@kong62
Copy link

kong62 commented May 8, 2023

same issue:

# kubectl drain --ignore-daemonsets --delete-emptydir-data --delete-local-data --grace-period=80 10.44.5.158
I0306 20:48:56.890979   12500 request.go:665] Waited for 1.126157702s due to client-side throttling, not priority and fairness, request: GET:https://10.44.6.13:60002/api/v1/namespaces/prod/pods/caesar-remotequery-msv-20220803145002-b08f661-784bbbd85b-6nnvw
I0306 20:49:06.891933   12500 request.go:665] Waited for 1.135381322s due to client-side throttling, not priority and fairness, request: GET:https://10.44.6.13:60002/api/v1/namespaces/prod/pods/basketball-pc-web-20230303181501-7cc54b4-6d65996dd6-225n9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
9 participants