Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go: do not print the request latency logs at level 0 by default #101634

Closed
wants to merge 1 commit into from

Conversation

SataQiu
Copy link
Member

@SataQiu SataQiu commented Apr 30, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.
IMO, we should print this message at a low level, but not at the default level of 0.

This PR:

  1. print the request latency logs when log level 2 is enabled
  2. add more test cases for throttledLogger

Which issue(s) this PR fixes:

Fixes #101631
Ref #88134 (before this PR, we print the log at level 2)

Special notes for your reviewer:

Does this PR introduce a user-facing change?

client-go: do not print the request latency logs at level 0 by default

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 30, 2021
@SataQiu
Copy link
Member Author

SataQiu commented Apr 30, 2021

/priority important-soon
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 30, 2021
@SataQiu
Copy link
Member Author

SataQiu commented Apr 30, 2021

/sig instrumentation

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 30, 2021
@SataQiu
Copy link
Member Author

SataQiu commented Apr 30, 2021

/cc @logicalhan

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2021
@SataQiu
Copy link
Member Author

SataQiu commented Apr 30, 2021

/test pull-kubernetes-e2e-gce-ubuntu-containerd

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this PR. From what I can tell you are changing a throttling setting, which shouldn't affect which level request latency logs are printed at. This change, IIUC, should just mean that instead of throttling at logs of V(0) we are throttling V(1) logs.

@jonnylangefeld
Copy link
Contributor

jonnylangefeld commented May 4, 2021

@logicalhan This change doesn't look like it's changing the throttling setting, but rather the log level at which the globalThrottledLogger logs.
This change
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L636
influences which logLevel is used for the this line
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L668
And the origin of the log message is ultimately here:
https://github.com/kubernetes/kubernetes/blob/f7bd79539240c57bd1a99795526880914e9e228b/staging/src/k8s.io/client-go/rest/request.go#L605

And it's true, it was last changed in the PR #88134 that @SataQiu linked. One commit before that the general log level for the globalThrottledLogger was set to 2 here:
https://github.com/jennybuckley/kubernetes/blob/0dfe0c793b5b85bd85cec7b77679ba78c1617357/staging/src/k8s.io/client-go/rest/request.go#L559

Since I updated my kubectl to include commit #88134 in my build I'm getting a lot of logs of the form

Waited for 1.140352693s due to client-side throttling, not priority and fairness, request: GET:https://10.212.0.242/apis/storage.k8s.io/v1beta1?timeout=32s

When querying any resource.

Hope this explanation helps and I give this a +1 to merge because this info is not useful for most api clients. And for the ones that need a bit more verbosity it will still show up when set verbosity to 1.

@caesarxuchao
Copy link
Member

/triage accept
/assign @logicalhan

@k8s-ci-robot
Copy link
Contributor

@caesarxuchao: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

In response to this:

/triage accept
/assign @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@caesarxuchao
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 4, 2021
@SataQiu SataQiu requested a review from logicalhan May 7, 2021 09:47
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: logicalhan, SataQiu
To complete the pull request process, please assign yliaog after the PR has been reviewed.
You can assign the PR to them by writing /assign @yliaog in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SataQiu
Copy link
Member Author

SataQiu commented May 8, 2021

/assign @liggitt

@liggitt
Copy link
Member

liggitt commented May 10, 2021

/unassign
/assign @lavalamp

looks like #88134 cited #87740 (comment) as the reason for the higher log level

@k8s-ci-robot k8s-ci-robot assigned lavalamp and unassigned liggitt May 10, 2021
@liggitt
Copy link
Member

liggitt commented May 10, 2021

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

@jonnylangefeld
Copy link
Contributor

@liggitt Maybe V(1) and V(2) would be appropriate log levels (instead of V(0) and V(2)).

@caesarxuchao
Copy link
Member

cc @deads2k

@deads2k
Copy link
Contributor

deads2k commented May 11, 2021

TBH, the messages have been useful to us at their current level because component authors find them and wonder why they see them. I'm biased towards keeping delays of over one second as info level messages.

@SataQiu do you have examples of healthy components (no hot or warm loops) with properly configured client-side rate limits seeing this message unnecessarily? It seems like these messages at an info level would be helpful, not harmful.

@SataQiu
Copy link
Member Author

SataQiu commented May 13, 2021

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

Emm...
As @liggitt said, there should be a workaround(such as kubectl get pod 2>/dev/null) to shutting out additional output.

/close

@k8s-ci-robot
Copy link
Contributor

@SataQiu: Closed this PR.

In response to this:

Printing the latency logs by default will break the user's original behavior, especially if they are doing something with the command output.

It prints to stderr, right? There are many things that can print to stderr, I would not expect structured output to be passed via that output

Emm...
As @liggitt said, there should be a workaround(such as kubectl get pod 2>/dev/null) to shutting out additional output.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lavalamp
Copy link
Member

My guess is that most people don't know they are using client side rate limits and on average printing this log message saves 10 hours of debugging for every hour it costs someone trying to hide it.

@jonnylangefeld
Copy link
Contributor

jonnylangefeld commented Oct 2, 2021

@lavalamp since everyone finds these messages helpful maybe I'm just interpreting them wrong. And rather than wanting to hide them I should figure out what causes them.
What could I do to eliminate the cause for this client-side throttling as a platform owner?

Edit: I know that our clusters have many CRDs (mainly through cloud vendor provisioners like Google Config Connector) and I know it has to do with that amount of CRDs and the discovery cache that kubernetes keeps around. But I don't know how to improve on the situation and how to not constantly run into rate-limiting...

$ kubectl get crd | awk 'NR>1' | wc -l
     225

@lavalamp
Copy link
Member

lavalamp commented Oct 4, 2021

@jonnylangefeld In code you can modify the rest.Config struct fields QPS and Burst. -1 disables, 0 chooses a default (50qps IIRC). Now that APF is defaulted on, I think most users can just disable client side limits and let the server worry about it (exception is if there are many watchers of the resources you are modifying, e.g. endpoints).

We were going to add these to kubeconfig but didn't because it's only applicable to the standard go client: kubernetes/enhancements#1630

Some binaries (kube-controller-manager) might expose command line flags for these.

@jonnylangefeld
Copy link
Contributor

Thanks for the explanation! hmm so for kubectl that would mean we maintain our own fork and build ourselves?

@lavalamp
Copy link
Member

lavalamp commented Oct 4, 2021

How are you hitting this limit with kubectl?

@jonnylangefeld
Copy link
Contributor

It seems like that has to do with the discovery cache. If I run kubectl get pod -v 8 I see 100s of requests made. They're all of this sort:

GET https://<host>/apis/dns.cnrm.cloud.google.com/v1beta1?timeout=32s
GET https://<host>/apis/templates.gatekeeper.sh/v1alpha1?timeout=32s
GET https://<host>/apis/firestore.cnrm.cloud.google.com/v1beta1?timeout=32s

So it's requests to all the custom resources we have installed on the cluster.
Only the very last request is for the actual pods:

GET https://10.216.64.18/api/v1/namespaces/default/pods?limit=500

I think the many GET requests trigger the client-side throttling.
This reddit post explains the discovery cache behavior, but not how it can be circumvented or at least not rate limited.

@lavalamp
Copy link
Member

lavalamp commented Oct 4, 2021

OK, that was going to be my guess. How many CRDs do you have? You should need over 100 group versions to trigger this.

@jonnylangefeld
Copy link
Contributor

Yeah we have a bunch. As mentioned in my comment above it's over 200 individual CRD resources:

$ kubectl get crd | awk 'NR>1' | wc -l
     225

And since you said only distinct group versions matter, I ran the following:

$ kubectl get crd -o json | jq -r '.items[] | "Group: " + .spec.group + "; Version: " + .spec.versions[0].name' | sort | uniq | wc -l
     72

So not quite 100 group versions but close.
We reach that fairly quick with istio, Google's Config Connector, gatekeeper, velero, cert-manager and a few in-house controllers.

Now the question would be: What options do we have to either extend the validity of the cache (so it can rely on it longer), or maybe even better only build the cache if I'm querying for a resource that's not already in the discovery cache?
Do you think that would be feasible?

@lavalamp
Copy link
Member

lavalamp commented Oct 5, 2021

That jq query is undercounting (need to emit a name for every version, not just the first).

I have filed an issue about this since no one is going to see this buried here :) #105489

Thanks for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubectl api returns Throttling request took message
8 participants