Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double counting issue for request metrics on timeout. #83427

Merged
merged 3 commits into from Oct 22, 2019

Conversation

@logicalhan
Copy link
Contributor

logicalhan commented Oct 2, 2019

Currently we record request metrics during the normal request flow; we also manually invoke Record in the timeout handler to record timeouts. This means that we double count requests to our metrics whenever we timeout. To preserve the bits where we do encounter an error but to avoid double-counting our metrics, this PR adds a new metric to represent the running count of apiserver request errors and records occurrences of encountered errors to that new metrics.

This PR also renames the Record function to RecordRequestError to more accurately reflect the intended side-effect of the function call.

What type of PR is this?

/kind bug

Which issue(s) this PR fixes:

Fixes #83424

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds a metric apiserver_request_error_total to kube-apiserver. This metric tallies the number of request_errors encountered by verb, group, version, resource, subresource, scope, component, and code. 
Currently we record request metrics during the normal request flow and
we also manually invoke `Record` in the timeout handler to record
timeouts. This means that we effectively double count whenever we
timeout. This PR renames the `Record` function to `RecordRequestError`
to more accurately reflect the intended side-effect of the function
call.

Change-Id: Ie37fd0c1e501bd525640a434433d364a5fd6dde2
@logicalhan

This comment has been minimized.

Copy link
Contributor Author

logicalhan commented Oct 2, 2019

/sig api-machinery instrumentation

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 2, 2019

@logicalhan: The label(s) priority/importent-soon cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/sig api-machinery instrumentation
/priority importent-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

logicalhan commented Oct 2, 2019

/priority important-soon

@logicalhan logicalhan force-pushed the logicalhan:metrics-timeout branch from 4c6e724 to 194aed7 Oct 3, 2019
@logicalhan

This comment has been minimized.

Copy link
Contributor Author

logicalhan commented Oct 3, 2019

/retest

@logicalhan

This comment has been minimized.

Copy link
Contributor Author

logicalhan commented Oct 6, 2019

For ref: #79609


requestErrorTotal = compbasemetrics.NewCounterVec(
&compbasemetrics.CounterOpts{
Name: "apiserver_request_error_total",

This comment has been minimized.

Copy link
@cheftako

cheftako Oct 7, 2019

Member

Pedantic: error should probably be pluralized.

Copy link
Member

cheftako left a comment

/lgtm
Please take a look at the metric name as I would rather not change it after this is submitted.

@k8s-ci-robot k8s-ci-robot added the lgtm label Oct 7, 2019
@logicalhan logicalhan force-pushed the logicalhan:metrics-timeout branch from 194aed7 to 9ba8ab3 Oct 8, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm label Oct 8, 2019
@cheftako

This comment has been minimized.

Copy link
Member

cheftako commented Oct 9, 2019

Will lgtm after gofmt.

Change-Id: I12eb94f41ded20ed5a16332ada13a7b34f75de18
@logicalhan logicalhan force-pushed the logicalhan:metrics-timeout branch from 9ba8ab3 to 5e652fe Oct 9, 2019
…ding documentation

Change-Id: I47a9c7b10614afe85bb652fa61984f91848d6d65
@k8s-ci-robot k8s-ci-robot added size/M and removed lgtm size/S labels Oct 21, 2019
@logicalhan

This comment has been minimized.

Copy link
Contributor Author

logicalhan commented Oct 21, 2019

/retest

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Oct 21, 2019

/lgtm

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Oct 21, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Oct 21, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Oct 21, 2019

If it is easy, I'd like to cherry-pick this back. This is a major defect in our metrics.

@k8s-ci-robot k8s-ci-robot merged commit aa25739 into kubernetes:master Oct 22, 2019
14 of 15 checks passed
14 of 15 checks passed
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation logicalhan authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 22, 2019
@RainbowMango

This comment has been minimized.

Copy link
Member

RainbowMango commented Oct 22, 2019

If it is easy, I'd like to cherry-pick this back. This is a major defect in our metrics.

@lavalamp
Which branches do you want to cherry-pick to? Can I try the cherry-pick process?

@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Oct 22, 2019

@RainbowMango please feel free!

I'd take this back as many versions as we support, 1.14, 1.15, 1.16, and 1.13 if it's still in the support window. Not sure how easily this will cherry pick to the older ones...

@tpepper

This comment has been minimized.

Copy link
Contributor

tpepper commented Nov 4, 2019

@lavalamp @RainbowMango are there going to be backports on this for 1.15 and 1.14?

k8s-ci-robot added a commit that referenced this pull request Nov 4, 2019
…#83427-upstream-release-1.16

Automated cherry pick of #83427: Fix double counting issue for request metrics on timeout.
@RainbowMango

This comment has been minimized.

Copy link
Member

RainbowMango commented Nov 5, 2019

@lavalamp @RainbowMango are there going to be backports on this for 1.15 and 1.14?

Oh, yes. I will try this as soon as possible. As mentioned by @logicalhan , we will probably need a manual PR to pick this fix back to 1.15 and 1.14.

According to kubernetes version policy, I think 1.13 is no longer maintained, so it wouldn't happen on 1.13.

The Kubernetes project maintains release branches for the most recent three minor releases.

RainbowMango added a commit to RainbowMango/kubernetes that referenced this pull request Nov 5, 2019
Fix double counting issue for request metrics on timeout.
RainbowMango added a commit to RainbowMango/kubernetes that referenced this pull request Nov 5, 2019
Fix double counting issue for request metrics on timeout.
k8s-ci-robot added a commit that referenced this pull request Nov 7, 2019
…83427-upstream-release-1.14

Manually cherry pick of #83427: Fix double counting issue for request metrics on timeout
k8s-ci-robot added a commit that referenced this pull request Nov 7, 2019
…83427-upstream-release-1.16

Manually cherry pick of #83427: Fix double counting issue for request metrics on timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.