Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump discovery burst for kubectl to 300 #105520

Merged
merged 2 commits into from
Nov 17, 2021

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented Oct 6, 2021

What type of PR is this?

/kind cleanup
/sig cli
/priority backlog

What this PR does / why we need it:

This bumps discovery burst for kubectl command from 100 defined in

// The more groups you have, the more discovery requests you need to make.
// given 25 groups (our groups + a few custom resources) with one-ish version each, discovery needs to make 50 requests
// double it just so we don't end up here again for a while. This config is only used for discovery.
discoveryBurst: 100,
to 150.

Which issue(s) this PR fixes:

Fixes kubernetes/kubectl#1126

Special notes for your reviewer:

/assign @seans3 @lavalamp @justinsb

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. priority/backlog Higher priority than priority/awaiting-more-evidence. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 6, 2021
@soltysh
Copy link
Contributor Author

soltysh commented Oct 6, 2021

/triage accept

@k8s-ci-robot
Copy link
Contributor

@soltysh: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

In response to this:

/triage accept

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 6, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 6, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubectl labels Oct 6, 2021
@@ -263,9 +263,6 @@ func (f *ConfigFlags) toDiscoveryClient() (discovery.CachedDiscoveryInterface, e
return nil, err
}

// The more groups you have, the more discovery requests you need to make.
// given 25 groups (our groups + a few custom resources) with one-ish version each, discovery needs to make 50 requests
// double it just so we don't end up here again for a while. This config is only used for discovery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in b3dad83 ... I guess 3 years qualifies as "a while"

kubeConfigFlags := genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag()
// The more groups you have, the more discovery requests you need to make.
// given 25 groups (our groups + a few custom resources) with one-ish version each, discovery needs to make 50 requests
// tripple it just so we don't end up here again for a while. This is updated from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this burst seems like the kubernetes equivalent of the debt ceiling ... if our response every time we hit it is to raise it, I'm not sure I see the point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should disable this and let APF slow the client if necessary

Copy link
Contributor Author

@soltysh soltysh Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, I'll re-work this to entirely remove this functionality from kubectl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the burst is part of client-go:

// If it's zero, the created RESTClient will use DefaultBurst: 10.

and the default there is even smaller than we we already set in kubectl. With that I'm seeing two options, we either drop it even from client-go, but I'll let you decided that, or we can set it artificially big in kubectl, 999, for example. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting it to -1 will disable it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the core point of this comment that increasing the value is kicking the can down the road.
However is disabling it completely not a risk that something like kubectl breaks the API server? That'd be annoying 😄

So I'm unsure if the real solution should be more something like @justinsb suggested where we lower the amount of API requests needed in the first place.

That being said, the change in this PR probably helps us to kick the can down just a little further and buy more time to implement a root-cause fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should disable this and let APF slow the client if necessary

Without knowing the full details of how APF would handle this, this would be my preference. Along with @justinsb's suggestion that the discovery process potentially be revisited to make fewer requests where possible. Full details in kubernetes/kubectl#1126 (comment), but we have cases where it's possible ~2,000 CRDs may end up installed and I don't think an additional 50 burst qps is going to make a meaningful difference in that situation.

@soltysh
Copy link
Contributor Author

soltysh commented Oct 29, 2021

@lavalamp @seans3 disabled it for kubectl, ptal

@negz
Copy link
Contributor

negz commented Oct 30, 2021

I was curious whether this PR would fix issues I've been seeing with discovery taking forever when there are many (hundreds) of CRDs, but found it did not work:

$ KUBECONFIG=~/control/negz/crossplane-scale/cluster-aws.kcfg _output/dockerized/bin/linux/amd64/kubectl get nodes
error: rate: Wait(n=1) exceeds limiter's burst -1

I opened #106016 which takes an alternative approach.

@negz
Copy link
Contributor

negz commented Nov 1, 2021

How do folks feel about this (or #106016) as a candidate for cherry-picking? We in @crossplane land have a feature that uses a lot of CRDs and is thus pretty degraded by huge (6+ minute) discovery wait times so we'd really appreciate being able to get this fix into the hands of our users. That said, I understand that removing client-side rate limits could be a hard sell as a cherry-pick.

@soltysh
Copy link
Contributor Author

soltysh commented Nov 2, 2021

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 16, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eddiezane, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit 0c47669 into kubernetes:master Nov 17, 2021
@soltysh soltysh deleted the bump_burst branch November 17, 2021 09:38
@jonnylangefeld
Copy link
Contributor

jonnylangefeld commented Dec 10, 2021

I just updated to the latest kubectl version via brew and see no improvement to the behavior before.

╰─ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:08:39Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/amd64"}

The commit ab69524 has the change of this PR:

kubeConfigFlags = genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag().WithDiscoveryBurst(300).WithDiscoveryQPS(50.0)

After removing the local cache via rm -rf ~/.kube/cache/discovery/<HOST_IP>, I still get throttled

╰─ kubectl get pods
I1209 19:22:06.899491   29710 request.go:665] Waited for 1.190499503s due to client-side throttling, not priority and fairness, request: GET:https://10.216.1.114/apis/cloudscheduler.cnrm.cloud.google.com/v1beta1?timeout=32s
I1209 19:22:17.101035   29710 request.go:665] Waited for 11.391773142s due to client-side throttling, not priority and fairness, request: GET:https://10.216.1.114/apis/bigtable.cnrm.cloud.google.com/v1beta1?timeout=32s

This cluster has 295 CRDs and 125 group versions:

╰─ kubectl get crd | wc -l
     295

╰─ kubectl get crd -o json | jq -r '.items[].spec | "Group: " + .group + "; Version: " + .versions[].name' | sort | uniq | wc -l
     125

This jq query was first introduced in #101634 (comment) and adapted to account for all versions, not just the first, as mentioned in @lavalamp's comment #101634 (comment).

If the cache is not there, there are 170 GET requests made. Once it is there, then only 4 GET requests are made (3 of the 4 are to /apis/external.metrics.k8s.io/v1beta1).

╰─ kubectl get pods -v 8 2>&1 | grep "GET https" | wc -l
     170

╰─ kubectl get pods -v 8 2>&1 | grep "GET https" | wc -l
       4

jonnylangefeld added a commit to jonnylangefeld/kubernetes that referenced this pull request Dec 20, 2021
This is a follow up to kubernetes#105520 which only changed the new default config flags in the `NewKubectlCommand` function if `kubeConfigFlags == nil`. However they are not nil because they were initialized before here:
https://github.com/kubernetes/kubernetes/blob/2fe968deb6cef4feea5bd0eb435e71844e397eed/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L97

This fix uses the same defaults for both functions

Signed-off-by: Jonny Langefeld <jonny.langefeld@gmail.com>
@jonnylangefeld
Copy link
Contributor

jonnylangefeld commented Dec 20, 2021

I debugged this a bit and noticed that kubeConfigFlags is never nil (or at least not in a regular kubectl command) because it is already initialized here:

ConfigFlags: genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag(),

So the changes in this PR of adding .WithDiscoveryBurst(300).WithDiscoveryQPS(50.0) never came into effect.

I created a fix with #107131. The results are evident as on the current master branch we still get the Waited for 1.190499503s due to client-side throttling and they don't come up with the fix anymore. It's also a bit faster finally as it's not running into the rate limiting. It still does 100s of unnecessary requests when the cash is invalid, but I opened a separate issue for that.

ulucinar pushed a commit to ulucinar/kubernetes that referenced this pull request Feb 28, 2022
This is a follow up to kubernetes#105520 which only changed the new default config flags in the `NewKubectlCommand` function if `kubeConfigFlags == nil`. However they are not nil because they were initialized before here:
https://github.com/kubernetes/kubernetes/blob/2fe968deb6cef4feea5bd0eb435e71844e397eed/staging/src/k8s.io/kubectl/pkg/cmd/cmd.go#L97

This fix uses the same defaults for both functions

Signed-off-by: Jonny Langefeld <jonny.langefeld@gmail.com>
YitzyD pushed a commit to YitzyD/kubernetes that referenced this pull request Mar 1, 2023
Bump discovery burst for kubectl to 300
YitzyD pushed a commit to YitzyD/kubernetes that referenced this pull request Mar 1, 2023
Bump discovery burst for kubectl to 300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discovery is throttled when there are lots of resources (CRDs)
10 participants