New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to avoid REST API calls at log level 2. #40933

Merged
merged 1 commit into from Apr 12, 2018

Conversation

@aveshagarwal
Member

aveshagarwal commented Feb 3, 2017

By default, logging REST API calls at log level 2 is too verbose (see below), if log level 2 happens to be default. So increasing its default to 3.

I0203 12:37:27.059555   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.358954ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:27.780029   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (819.444µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844846   11483 wrap.go:75] POST /api/v1/namespaces/default/events: (11.337447ms) 201 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:27.844851   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (9.998981ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:28.942863   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (923.711µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:28.944556   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.188942ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:29.061483   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (937.549µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:29.063068   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.111312ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:30.947922   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (935.198µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:30.950150   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.703438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:31.064883   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.127992ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:31.066503   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.098029ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:32.951691   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (945.295µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:32.953580   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.331822ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:33.068221   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (912.121µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:33.069787   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.120666ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:34.955546   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.02279ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:34.957812   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.661017ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:35.071528   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (935.155µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.073087   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.027371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:35.580075   11483 wrap.go:75] GET /apis/extensions/v1beta1/thirdpartyresources: (943.453µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.611659   11483 wrap.go:75] GET /api/v1/namespaces/default: (753.781µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.612516   11483 wrap.go:75] GET /api/v1/namespaces/default/services/kubernetes: (495.105µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.613167   11483 wrap.go:75] GET /api/v1/namespaces/default/endpoints/kubernetes: (379.568µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35144]
I0203 12:37:36.960131   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.431137ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:36.963470   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (2.190438ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:37.028185   11483 wrap.go:75] GET /api/v1/nodes: (1.34149ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/pod-garbage-collector] [::1]:35156]
I0203 12:37:37.074666   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (928.261µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.076314   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.240852ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:37.847163   11483 wrap.go:75] GET /api/v1/nodes/127.0.0.1?resourceVersion=0: (725.021µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:37.901326   11483 wrap.go:75] PATCH /api/v1/nodes/127.0.0.1/status: (2.377445ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format] [::1]:35154]
I0203 12:37:38.968028   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (3.777083ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:38.970313   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.655815ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:39.077913   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (919.65µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:39.079617   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.196855ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:40.972171   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (936.07µs) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:40.973886   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-controller-manager: (1.200048ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35156]
I0203 12:37:41.084158   11483 wrap.go:75] GET /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (3.842758ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [::1]:35168]
I0203 12:37:41.085722   11483 wrap.go:75] PUT /api/v1/namespaces/kube-system/endpoints/kube-scheduler: (1.101371ms) 200 [[hyperkube/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election] [

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1414813

@kubernetes/rh-cluster-infra

Fix #47916

@k8s-reviewable

This comment has been minimized.

k8s-reviewable commented Feb 3, 2017

This change is Reviewable

@eparis

This comment has been minimized.

Member

eparis commented Feb 3, 2017

was this a change since 1.4?

@eparis

This comment has been minimized.

Member

eparis commented Feb 3, 2017

@smarterclayton looks like you originally picked v(2) way back in 4e56daf

#1359

Who would need to approve moving to V(3)

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Feb 3, 2017

@lavalamp because we talked about this being an operational tool for normal function, and I suspect people rely on it in GKE and others at v2 today to get logging. Changing to v3 would break anyone who scrapes at v2.

That said, I agree this is very verbose.

@eparis

This comment has been minimized.

Member

eparis commented Feb 3, 2017

on a cluster of any size doing work it seems to be swampingly verbose :) whereas the other v(2) stuff isn't so bad...

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Feb 3, 2017

My gut would be that it should be easy to opt out of this and that most people most of the time don't need this (which is what 3 was for). But because it has operational impact...

@gmarek

This comment has been minimized.

Member

gmarek commented Feb 6, 2017

cc @wojtek-t

I'm leaning against this PR. Those are probably most useful logs in the whole system - I look at them pretty much every time I'm debugging anything. I agree that they are verbose, but they're not for direct human consumption, not enhanced by the power for grep.

@aveshagarwal - why do you want to make this change? Is the size of logs problem?

@aveshagarwal

This comment has been minimized.

Member

aveshagarwal commented Feb 6, 2017

@gmarek I agree these are useful logs, but the question is do we want these logs at --v=2?

@gmarek

This comment has been minimized.

Member

gmarek commented Feb 6, 2017

I think that we do. Logs are there to help figure out what's wrong, and those logs are extremely important for that. If someone want no-logging (aka 'I don't care') there's always V=0.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Feb 6, 2017

@gmarek

This comment has been minimized.

Member

gmarek commented Feb 7, 2017

I do think that 'every (sane) user wants them', but that's not really a formal definition:)

@gmarek

This comment has been minimized.

Member

gmarek commented Feb 8, 2017

In other words - I think we should close this PR.

@mffiedler

This comment has been minimized.

mffiedler commented Feb 8, 2017

Level 2 usually provides enough info for devs to start debug process. Having that level sane really helps when difficult to repro problems are encountered.

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Feb 8, 2017

@gmarek

This comment has been minimized.

Member

gmarek commented Feb 23, 2017

I'm closing this PR. Ping me if you disagree.

@gmarek gmarek closed this Feb 23, 2017

@ncdc

This comment has been minimized.

Member

ncdc commented Apr 12, 2017

@gmarek we have been discussing this again today. I agree with you that these logs are useful when debugging issues. However, they have a negative performance impact when running large clusters. Would you think it's reasonable to proceed with this, with the expectation that for production clusters, admins would increase the loglevel from 2 to 3 or 4 if they need to debug issues? Otherwise, 2 would be an expected default for informational log messages but not spammy ones. Also, for PR jobs, we should be running at 3 or 4 if we're not already. WDYT?

@ncdc ncdc referenced this pull request Apr 12, 2017

Closed

REST spam #13724

@gmarek

This comment has been minimized.

Member

gmarek commented Apr 12, 2017

No, level 3 is already too verbose. I understand the performance implications, but maybe we could get past them by dealing with --logtostderr flag, which is most likely main culprit (unbuffered writes)? I don't know what/if golang team is doing with glog package, so I don't know if it's possible...

Also, this is just my (admittedly very strong) opinion, so if enough important people agree that we should do that, then I won't be able to block this (though honestly - I'm pretty convinced that it'll cost you more debugging time, that you'll save on CPU).

Can't you run cluster with --v=1?

@wojtek-t @smarterclayton @thockin @lavalamp

@ncdc

This comment has been minimized.

Member

ncdc commented Apr 12, 2017

@gmarek if we run the cluster at --v=1 and run into an issue, we almost always have to bump up to --v=2 (or higher) to truly debug. I suppose another option is to move the non-spammy messages we care about that are currently at 2 down to 1. But from what I've heard from @smarterclayton, we currently want to run at 2 by default. We are trying to eliminate overly spammy messages from 2.

My suggestion is that we move this to 3, run clusters at 2, and if there is an issue that needs debugging, you restart the components at 3 and try to reproduce the issue, if that's acceptable.

@gmarek

This comment has been minimized.

Member

gmarek commented Apr 12, 2017

The thing is that you never know when issue will surface, and when to do this.

@eparis

This comment has been minimized.

Member

eparis commented Apr 12, 2017

For the most part I feel like v(2) has informational changes about a cluster that an admin cares about. v(3) has engineering debug stuff. This is CLEARLY engineering debug stuff. Practically no admin wants to be spammed with these. Real life example on a small cluster (7 nodes with 190 pods) we see over 82,000 messages per hour from this. And less than 4000 messages per hour for not this.

This one spam accounts for x20 more messages than the rest of V(2) put together.

@eparis eparis reopened this Apr 12, 2017

@ncdc

This comment has been minimized.

Member

ncdc commented Apr 12, 2017

@gmarek it's a balance, and it's one that's hard to win :-) You either spam the cluster with logs that you don't need 99% of the time, or you don't and it's harder to debug the past.

@k8s-ci-robot k8s-ci-robot removed the keep-open label Dec 15, 2017

@brendandburns

This comment has been minimized.

Contributor

brendandburns commented Jan 3, 2018

@gmarek @eparis do we have a conclusion here?

@cblecker

This comment has been minimized.

Member

cblecker commented Feb 10, 2018

This appears to be somewhat active. Removing the lifecycle/frozen state.

/remove-lifecycle frozen

@liggitt

This comment has been minimized.

Member

liggitt commented Mar 26, 2018

But IIUC it'll contain exactly the same information you're complaining about, so the problem will not go away, it'll just change the name.

audit lets you select some resources and not others, and allows for async/batch processing to avoid bottlenecks

I'm in favor of this change

@soltysh

This comment has been minimized.

Contributor

soltysh commented Apr 6, 2018

/refresh

@cblecker

This comment has been minimized.

Member

cblecker commented Apr 6, 2018

/test all

@sjenning

This comment has been minimized.

Contributor

sjenning commented Apr 11, 2018

/lgtm

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 11, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aveshagarwal, brendandburns, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubernetes kubernetes deleted a comment from k8s-merge-robot Apr 11, 2018

@fejta-bot

This comment has been minimized.

fejta-bot commented Apr 11, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

fejta-bot commented Apr 12, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 12, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 12, 2018

@aveshagarwal: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-kubemark-e2e-gce 6a5c248 link /test pull-kubernetes-kubemark-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 12, 2018

Automatic merge from submit-queue (batch tested with PRs 40933, 62188). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 2b854df into kubernetes:master Apr 12, 2018

18 of 20 checks passed

pull-kubernetes-kubemark-e2e-gce Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-kubemark-e2e-gce
Details
cla/linuxfoundation aveshagarwal authorized
Details
pull-kubernetes-bazel Job succeeded.
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-etcd3 Jenkins job succeeded.
Details
pull-kubernetes-e2e-gce-gpu Jenkins job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-federation-e2e-gce Jenkins job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-unit Context retired. Status moved to "pull-kubernetes-integration".
pull-kubernetes-verify Job succeeded.
Details
@lavalamp

This comment has been minimized.

Member

lavalamp commented Apr 12, 2018

....I was writing a comment, but the bot merged this, apparently on a phantom approval.

I think changing this deserved publication on sigs for people who run clusters. [edit: assuming this didn't happen already]

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 12, 2018

@shyamjvs @kubernetes/sig-scalability-bugs

I don't think we should have merged that - this will significantly affect our debugging ability (at least in the context of performance tests, where we can't really run tests on higher verbosity, since the amount of logs there is overloading apiserver).

So I think we may consider reverting it, and making verbosity for those logs a flag (I'm fine with having a default still at 3, but I would like to be able to modify that for the purpose of scalability tests).

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 12, 2018

I've opened #62468 to not forget about it.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Apr 12, 2018

Sorry, I missed this earlier. Definitely agree with @wojtek-t on how this can greatly hurt debuggability, specially wrt performance/scalability issues.
I'd vote for 'revert + introduce flag' too.

@ncdc

This comment has been minimized.

Member

ncdc commented Apr 12, 2018

@shyamjvs please see the -vmodule comments in #62468. We probably don't need to revert this if it's acceptable to adjust the scalability logging flags.

@eparis

This comment has been minimized.

Member

eparis commented Apr 12, 2018

I think Andy is right and this PR is right, -vmodule can make these logs available but cannot be used to hide them. vmodule only goes in 1 direction.

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented Apr 13, 2018

I've sent a PR for enabling the logs through -vmodule for tests.
However, there's still a discussion whether it should be a prod default under #62468 (comment).

@aveshagarwal aveshagarwal deleted the aveshagarwal:master-rhbz-1414813 branch Aug 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment