New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use prometheus conventions for workqueue metrics #71300

Merged
merged 2 commits into from Jan 1, 2019

Conversation

@danielqsj
Copy link
Member

danielqsj commented Nov 21, 2018

What type of PR is this?
/kind feature
/sig api-machinery

What this PR does / why we need it:
Use prometheus conventions for workqueue metrics

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71165

Special notes for your reviewer:
This patch does not remove the existing metrics but mark them as deprecated.
We need 2 releases for users to convert monitoring configuration.

Does this PR introduce a user-facing change?:

Use prometheus conventions for workqueue metrics.
It is now deprecated to use the following metrics:
* `{WorkQueueName}_depth`
* `{WorkQueueName}_adds`
* `{WorkQueueName}_queue_latency`
* `{WorkQueueName}_work_duration`
* `{WorkQueueName}_unfinished_work_seconds`
* `{WorkQueueName}_longest_running_processor_microseconds`
* `{WorkQueueName}_retries`
Please convert to the following metrics:
* `workqueue_depth`
* `workqueue_adds_total`
* `workqueue_queue_latency_seconds`
* `workqueue_work_duration_seconds`
* `workqueue_unfinished_work_seconds`
* `workqueue_longest_running_processor_seconds`
* `workqueue_retries_total`
@danielqsj

This comment has been minimized.

Copy link
Member

danielqsj commented Nov 21, 2018

/assign @mortent

QueueLatencyKey = "queue_latency_microseconds"
WorkDurationKey = "work_duration_microseconds"
UnfinishedWorkKey = "unfinished_work_seconds"
LongestRunningProcessorKey = "longest_running_processor_microseconds"

This comment has been minimized.

@mortent

mortent Nov 21, 2018

Member

I think we should stick with one common unit for all the metrics for workqueue and not mix seconds and microseconds. Seconds is one the base units suggested in the prometheus docs (https://prometheus.io/docs/practices/naming/), so I think we can use that unless we have a good reason to use microseconds.

This comment has been minimized.

@danielqsj

danielqsj Nov 22, 2018

Member

Agree. Changed unit to seconds. PTAL

@mortent

This comment has been minimized.

Copy link
Member

mortent commented Nov 24, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Nov 24, 2018

@jennybuckley

This comment has been minimized.

Copy link
Contributor

jennybuckley commented Nov 26, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 26, 2018

@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mortent

This comment has been minimized.

Copy link
Member

mortent commented Nov 29, 2018

/assign @smarterclayton

@logicalhan
Copy link
Contributor

logicalhan left a comment

I realize you did not create the files but since you are touching rate_limitting_queue_test.go, would you mind renaming rate_limitting_queue.go and rate_limitting_queue_test.go? Limitting is a typo.

@danielqsj

This comment has been minimized.

Copy link
Member

danielqsj commented Dec 4, 2018

@logicalhan good catch, let's discuss it in #71683 or #71684 .

@yue9944882

This comment has been minimized.

Copy link
Member

yue9944882 commented Dec 5, 2018

/remove-sig api-machinery
/sig instrumentation

return adds
}

func (prometheusMetricsProvider) NewLatencyMetric(name string) workqueue.SummaryMetric {

This comment has been minimized.

@loburm

loburm Dec 6, 2018

Contributor

We should stop using Summary metrics, please use Histogram instead.

Summary metrics can't be aggregated.

This comment has been minimized.

@danielqsj

danielqsj Dec 6, 2018

Member

@loburm what buckets do you prefer or just ignore it now?

This comment has been minimized.

@loburm

loburm Dec 6, 2018

Contributor

I'm not familiar with those queues here. I remember that default is for example almost useless for kube-apiserver requests latency, because most of samples are going to first few buckets and wasn't giving enough information for measuring it.

Usually I prefer near 20 reasonable buckets. Let's ask for advice from someone sig-instrumentation.

@DirectXMan12 @brancz

This comment has been minimized.

@brancz

brancz Dec 6, 2018

Member

As this is on internal queues, the latencies should be rather small, I'd suggest something along the lines of:

prometheus.ExponentialBuckets(10e-9, 10, 10)

That gives us exponential buckets from 1 nanosecond to 10 seconds.

This comment has been minimized.

@loburm

loburm Dec 6, 2018

Contributor

Not sure that the best approach. I would check current values from few kube-apiservers and select range base on it.

This comment has been minimized.

@brancz

brancz Dec 6, 2018

Member

If there are histograms for queues in the apiserver, then yes we should be consistent, latency histograms for api requests (as in a service that performs network requests) are very different though from queues. Queues should be substantially faster.

This comment has been minimized.

@danielqsj

danielqsj Dec 11, 2018

Member

@loburm have the conclusion about the buckets ?

This comment has been minimized.

@loburm

loburm Dec 11, 2018

Contributor

Do you have any data about current distribution of those samples? But if you are happy with buckets:
1ns - 10ns
10ns - 100ns...
1s - 10s
Then you can proposal about.

This comment has been minimized.

@danielqsj

danielqsj Dec 12, 2018

Member

After checking current samples, I agree with your proposal.
Code fixed. PTAL @loburm @brancz

@logicalhan logicalhan referenced this pull request Dec 8, 2018

Closed

REQUEST: New membership for @logicalhan #292

6 of 6 tasks complete

@danielqsj danielqsj force-pushed the danielqsj:71165 branch from 22c3363 to f7f7500 Dec 12, 2018

@loburm

This comment has been minimized.

Copy link
Contributor

loburm commented Dec 12, 2018

/lgtm

@danielqsj danielqsj force-pushed the danielqsj:71165 branch from f7f7500 to 42214c5 Dec 12, 2018

@k8s-ci-robot k8s-ci-robot removed the lgtm label Dec 12, 2018

@danielqsj

This comment has been minimized.

Copy link
Member

danielqsj commented Dec 12, 2018

@loburm @mortent @brancz after format code, need a new LGTM, thanks

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 12, 2018

@loburm

loburm approved these changes Dec 12, 2018

Copy link
Contributor

loburm left a comment

/lgtm

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Dec 14, 2018

/lgtm
/approve

Could you also do a PR to add this to the metrics overhaul KEP? I just want to make sure we keep everything in one place documented.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Dec 14, 2018

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 14, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, danielqsj, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ash2k

This comment has been minimized.

Copy link
Member

ash2k commented Jan 1, 2019

/test pull-kubernetes-godeps

@k8s-ci-robot k8s-ci-robot merged commit 7284660 into kubernetes:master Jan 1, 2019

19 checks passed

cla/linuxfoundation danielqsj authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

@danielqsj danielqsj deleted the danielqsj:71165 branch Jan 8, 2019

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Jan 10, 2019

@danielqsj can you make sure to create a follow up for this one to add the deprecation notice in the help text for the metrics deprecated in this PR as well? Thanks!

@danielqsj

This comment has been minimized.

Copy link
Member

danielqsj commented Jan 14, 2019

@brancz yes, this PR #72679 aims to mark these deprecated metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment