Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change latency bucket size for API server metrics #67476

Closed

Conversation

mikkeloscar
Copy link
Contributor

@mikkeloscar mikkeloscar commented Aug 16, 2018

What this PR does / why we need it:

For the apiserver_request_latencies metric, the histogram buckets defined were in the range 125ms to 8s. This causes the metrics to be very skewed if the service is much faster than the 125ms minimum.
Prometheus client library provides default buckets in the range 5ms to 10s which is more sensible for a range of different environment.

The default buckets are tailored to broadly measure the response time (in seconds) of a network service.

This changes the bucket sizes for the apiserver_request_latencies metric to the defaults provided by prometheus and also changes the unit from microseconds to seconds.

This is reflected by changing the metric names:

  • apiserver_request_latencies -> apiserver_request_latency_seconds
  • apiserver_request_latencies_summary -> apiserver_request_latency_seconds_summary

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63750

Special notes for your reviewer:

/cc @brancz

Release note:

Change API server latency metrics to use seconds as unit and default Prometheus histogram buckets 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2018
@mikkeloscar
Copy link
Contributor Author

/assign brancz

@wojtek-t
Copy link
Member

@shyamjvs

@shyamjvs
Copy link
Member

I'm not an expert, but isn't this a breaking change?

Name: "apiserver_request_latency_seconds",
Help: "Response latency distribution in seconds for each verb, resource and subresource.",
// Use buckets ranging from 5 ms to 10 seconds.
Buckets: prometheus.DefBuckets,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @mikkeloscar, thanks for fixing this.

should be microseconds, not seconds. *10^6

ExponentialBuckets(125000, 2.0, 7): [125000 250000 500000 1e+06 2e+06 4e+06 8e+06]
DefBuckets: [0.005 0.01 0.025 0.05 0.1 0.25 0.5 1 2.5 5 10]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh.. tricked.. 👍🏻

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a few more buckets, maybe 25s and 50s.

Copy link
Member

@ehashman ehashman Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do; the 8s limit is messing up histogram calculations for verbs like WATCH and CONNECT which tend to have super long latencies. It would probably be valuable to have the largest bucket coincide with the global timeout default (currently 60s iirc).

@yue9944882
Copy link
Member

@shyamjvs @wojtek-t not sure. but the intention of this pull makes sense to me. whats more it'll no impact on performance.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mikkeloscar
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: sttts

If they are not already assigned, you can assign the PR to them by writing /assign @sttts in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shyamjvs
Copy link
Member

Changing the buckets may still be fine (I'm not sure), but renaming metrics does sound like a breaking change. Please hold until this is sorted out.

@lavalamp @smarterclayton

@k8s-ci-robot
Copy link
Contributor

@mikkeloscar: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-integration 279c7f2afc25d8be6cc6c1a0dd8fae512cb166d2 link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@brancz
Copy link
Member

brancz commented Aug 16, 2018

We don't have any stability guarantees on metrics and we have broken them in various ways in the past, which doesn't mean we should continue to do so for no reason, but as long as breaking metrics puts them in line with the official instrumentation guidelines, I think a change is for the better.

Changing the buckets already breaks the semantics of the metric and as a user of any system I prefer a hard break, rather than a subtle one.

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 16, 2018 via email

@neolit123
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Aug 16, 2018
@brancz
Copy link
Member

brancz commented Aug 17, 2018

I completely agree that this needs to be widely and clearly communicated.

We have been discussing in sig-instrumentation to do a general overhaul of all the metrics we expose to at least ensure everything is in line with our instrumentation guidelines. What does everyone think of instead of doing one off improvements, instead collecting all the metrics we would like to change and do it in one go? There are numerous violations across the codebase, in terms of metric naming, label naming and non-base units being used, unifying this will give users a better experience working with metrics exposed by Kubernetes.

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 17, 2018 via email

@@ -193,10 +193,10 @@ func RecordLongRunning(req *http.Request, requestInfo *request.RequestInfo, fn f
func MonitorRequest(req *http.Request, verb, resource, subresource, scope, contentType string, httpCode, respSize int, elapsed time.Duration) {
reportedVerb := cleanVerb(verb, req)
client := cleanUserAgent(utilnet.GetHTTPClient(req))
elapsedMicroseconds := float64(elapsed / time.Microsecond)
elapsedSeconds := float64(elapsed / time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test this until you find the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikkeloscar the underlying type of time.Duration is int64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yue9944882 thanks for the hint, fixed.

@lavalamp
Copy link
Member

Making a batch update for the metrics changes sounds good, especially if it comes with an easy-to-understand breakdown of the changes.

@mikkeloscar
Copy link
Contributor Author

So how do we continue here? Do you want to consider this individual change or do you want to make it part of a batch update?

Regarding the rename and breaking backwards compatibility I totally understand your concerns and as a user of Kubernetes (or any system really) I'm also in favor of stability. However, the fact that my original issue (#63750) didn't get any comments for 3 months and no-one reported a similar issue afaik, suggest to me that not many people are using these metrics at this moment, or if they are, they don't know that they are wrong :)
I'm not suggesting this is a good enough reason to break backwards compatibility, but something to keep in mind.

@matthiasr
Copy link

👋 user of these metrics here – in our case most request latencies are longer than 125ms so the existing bucketing is useful, that's why we did not see/comment on #63750.

This is not to say that it could or should not be improved, just verifying that these metrics are being used and communication around updating them is necessary. A major renaming is painful but better than many subtle changes in my opinion.

@wenjiaswe
Copy link
Contributor

@wenjiaswe

@jrake-revelant
Copy link

Hello folks, what is the status of the review?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018
For the `apiserver_request_latencies` metric, the histogram buckets
defined were in the range 125ms to 8s. This causes the metrics to be
very skewed if the service is much faster than the 125ms minimum.
Prometheus client library provides default buckets in the range 5ms to
10s which is more sensible for a range of different environment.

> The default buckets are tailored to broadly measure the response time
> (in seconds) of a network service.

This changes the bucket sizes for the `apiserver_request_latencies`
metric to the defaults provided by prometheus and also changes the unit
from microseconds to seconds.

This is reflected by changing the metric names:

* `apiserver_request_latencies` -> `apiserver_request_latency_seconds`
* `apiserver_request_latencies_summary` -> `apiserver_request_latency_seconds_summary`

Fix kubernetes#63750

Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>
Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 18, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2019
@wojtek-t
Copy link
Member

This has been fixed in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sub-optimal sizing of buckets exposed in the apiserver_request_latencies metrics