Change latency bucket size for API server metrics #67476

mikkeloscar · 2018-08-16T04:56:24Z

What this PR does / why we need it:

For the apiserver_request_latencies metric, the histogram buckets defined were in the range 125ms to 8s. This causes the metrics to be very skewed if the service is much faster than the 125ms minimum.
Prometheus client library provides default buckets in the range 5ms to 10s which is more sensible for a range of different environment.

The default buckets are tailored to broadly measure the response time (in seconds) of a network service.

This changes the bucket sizes for the apiserver_request_latencies metric to the defaults provided by prometheus and also changes the unit from microseconds to seconds.

This is reflected by changing the metric names:

apiserver_request_latencies -> apiserver_request_latency_seconds
apiserver_request_latencies_summary -> apiserver_request_latency_seconds_summary

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63750

Special notes for your reviewer:

/cc @brancz

Release note:

Change API server latency metrics to use seconds as unit and default Prometheus histogram buckets

mikkeloscar · 2018-08-16T04:58:03Z

/assign brancz

wojtek-t · 2018-08-16T06:12:39Z

@shyamjvs

shyamjvs · 2018-08-16T07:00:05Z

I'm not an expert, but isn't this a breaking change?

yue9944882 · 2018-08-16T07:19:14Z

staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go

+			Name: "apiserver_request_latency_seconds",
+			Help: "Response latency distribution in seconds for each verb, resource and subresource.",
+			// Use buckets ranging from 5 ms to 10 seconds.
+			Buckets: prometheus.DefBuckets,


hi @mikkeloscar, thanks for fixing this.

should be microseconds, not seconds. *10^6

ExponentialBuckets(125000, 2.0, 7): [125000 250000 500000 1e+06 2e+06 4e+06 8e+06]
DefBuckets: [0.005 0.01 0.025 0.05 0.1 0.25 0.5 1 2.5 5 10]

I changed the unit to seconds as suggested by @brancz #63750 (comment)

Change in unit is here: https://github.com/kubernetes/kubernetes/pull/67476/files#diff-c4dd16fa62ccf218ce751334493c3b3eR196

oh.. tricked.. 👍🏻

I would add a few more buckets, maybe 25s and 50s.

Please do; the 8s limit is messing up histogram calculations for verbs like WATCH and CONNECT which tend to have super long latencies. It would probably be valuable to have the largest bucket coincide with the global timeout default (currently 60s iirc).

yue9944882 · 2018-08-16T07:27:30Z

@shyamjvs @wojtek-t not sure. but the intention of this pull makes sense to me. whats more it'll no impact on performance.

k8s-ci-robot · 2018-08-16T08:41:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mikkeloscar
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: sttts

If they are not already assigned, you can assign the PR to them by writing /assign @sttts in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

staging/src/k8s.io/apiserver/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shyamjvs · 2018-08-16T08:49:50Z

Changing the buckets may still be fine (I'm not sure), but renaming metrics does sound like a breaking change. Please hold until this is sorted out.

@lavalamp @smarterclayton

k8s-ci-robot · 2018-08-16T09:07:35Z

@mikkeloscar: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-integration	279c7f2afc25d8be6cc6c1a0dd8fae512cb166d2	link	`/test pull-kubernetes-integration`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

brancz · 2018-08-16T11:40:23Z

We don't have any stability guarantees on metrics and we have broken them in various ways in the past, which doesn't mean we should continue to do so for no reason, but as long as breaking metrics puts them in line with the official instrumentation guidelines, I think a change is for the better.

Changing the buckets already breaks the semantics of the metric and as a user of any system I prefer a hard break, rather than a subtle one.

smarterclayton · 2018-08-16T20:21:27Z

We haven't taken a hard line on changing metrics. So far we have treated them under a reduced API guarantee - extra scrutiny, but not requiring them to not change. I would probably lean towards it being ok to change, but to ensure we widely communicate the change in the metric as part of release notes, that we ensure we are communicating to operators in other mediums to make them aware, and that we not back port those changes unless necessary.

…

On Thu, Aug 16, 2018 at 7:40 AM Frederic Branczyk ***@***.***> wrote: We don't have any stability guarantees on metrics and we have broken them in various ways in the past, which doesn't mean we should continue to do so for no reason, but as long as breaking metrics puts them in line with the official instrumentation guidelines <https://github.com/kubernetes/community/blob/master/contributors/devel/instrumentation.md>, I think a change is for the better. Changing the buckets already breaks the semantics of the metric and as a user of any system I prefer a hard break, rather than a subtle one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67476 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pyUTOmYHn1J7dws_-yW8m9D64Bmwks5uRVo2gaJpZM4V_Mwx> .

neolit123 · 2018-08-16T22:06:26Z

/sig api-machinery

brancz · 2018-08-17T07:44:59Z

I completely agree that this needs to be widely and clearly communicated.

We have been discussing in sig-instrumentation to do a general overhaul of all the metrics we expose to at least ensure everything is in line with our instrumentation guidelines. What does everyone think of instead of doing one off improvements, instead collecting all the metrics we would like to change and do it in one go? There are numerous violations across the codebase, in terms of metric naming, label naming and non-base units being used, unifying this will give users a better experience working with metrics exposed by Kubernetes.

smarterclayton · 2018-08-17T20:00:29Z

That seems reasonable if sig driven, and the more changes that happen together the less likely an individual user is to not notice something important happened

…

On Fri, Aug 17, 2018 at 3:45 AM Frederic Branczyk ***@***.***> wrote: I completely agree that this needs to be widely and clearly communicated. We have been discussing in sig-instrumentation to do a general overhaul of all the metrics we expose to at least ensure everything is in line with our instrumentation guidelines. What does everyone think of instead of doing one off improvements, instead collecting all the metrics we would like to change and do it in one go? There are numerous violations across the codebase, in terms of metric naming, label naming and non-base units being used, unifying this will give users a better experience working with metrics exposed by Kubernetes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#67476 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p2zkmV4ucZBc4O_7HardXnvcVbGBks5uRnSGgaJpZM4V_Mwx> .

lavalamp · 2018-08-17T20:36:17Z

staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go

@@ -193,10 +193,10 @@ func RecordLongRunning(req *http.Request, requestInfo *request.RequestInfo, fn f
 func MonitorRequest(req *http.Request, verb, resource, subresource, scope, contentType string, httpCode, respSize int, elapsed time.Duration) {
 	reportedVerb := cleanVerb(verb, req)
 	client := cleanUserAgent(utilnet.GetHTTPClient(req))
-	elapsedMicroseconds := float64(elapsed / time.Microsecond)
+	elapsedSeconds := float64(elapsed / time.Second)


Please test this until you find the bug.

@mikkeloscar the underlying type of time.Duration is int64

@yue9944882 thanks for the hint, fixed.

lavalamp · 2018-08-17T20:40:45Z

Making a batch update for the metrics changes sounds good, especially if it comes with an easy-to-understand breakdown of the changes.

mikkeloscar · 2018-08-21T06:33:49Z

So how do we continue here? Do you want to consider this individual change or do you want to make it part of a batch update?

Regarding the rename and breaking backwards compatibility I totally understand your concerns and as a user of Kubernetes (or any system really) I'm also in favor of stability. However, the fact that my original issue (#63750) didn't get any comments for 3 months and no-one reported a similar issue afaik, suggest to me that not many people are using these metrics at this moment, or if they are, they don't know that they are wrong :)
I'm not suggesting this is a good enough reason to break backwards compatibility, but something to keep in mind.

matthiasr · 2018-09-08T17:53:51Z

👋 user of these metrics here – in our case most request latencies are longer than 125ms so the existing bucketing is useful, that's why we did not see/comment on #63750.

This is not to say that it could or should not be improved, just verifying that these metrics are being used and communication around updating them is necessary. A major renaming is painful but better than many subtle changes in my opinion.

wenjiaswe · 2018-09-24T18:13:20Z

@wenjiaswe

jrake-revelant · 2018-11-21T13:36:14Z

Hello folks, what is the status of the review?

For the `apiserver_request_latencies` metric, the histogram buckets defined were in the range 125ms to 8s. This causes the metrics to be very skewed if the service is much faster than the 125ms minimum. Prometheus client library provides default buckets in the range 5ms to 10s which is more sensible for a range of different environment. > The default buckets are tailored to broadly measure the response time > (in seconds) of a network service. This changes the bucket sizes for the `apiserver_request_latencies` metric to the defaults provided by prometheus and also changes the unit from microseconds to seconds. This is reflected by changing the metric names: * `apiserver_request_latencies` -> `apiserver_request_latency_seconds` * `apiserver_request_latencies_summary` -> `apiserver_request_latency_seconds_summary` Fix kubernetes#63750 Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>

Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>

fejta-bot · 2019-03-07T18:16:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2019-03-11T09:36:42Z

This has been fixed in the meantime.

k8s-ci-robot requested review from jimmidyson and wojtek-t August 16, 2018 04:56

k8s-ci-robot assigned brancz Aug 16, 2018

wojtek-t assigned shyamjvs Aug 16, 2018

yue9944882 reviewed Aug 16, 2018

View reviewed changes

k8s-ci-robot assigned yue9944882 Aug 16, 2018

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2018

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Aug 16, 2018

lavalamp reviewed Aug 17, 2018

View reviewed changes

k8s-ci-robot added the area/apiserver label Aug 21, 2018

metalmatze mentioned this pull request Sep 24, 2018

[WIP] etcd-mixin: Don't show apiserver and kube-controllers in cluster dropdown etcd-io/etcd#10116

Closed

metalmatze mentioned this pull request Oct 1, 2018

contrib/kube-prometheus: Drop etcd metrics by apiserver & kube controller prometheus-operator/prometheus-operator#1959

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018

mikkeloscar added 2 commits December 7, 2018 19:00

Correctly divide with float64

e8b825b

Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>

mikkeloscar force-pushed the metric-latency-bucket branch from b63fa02 to e8b825b Compare December 7, 2018 18:02

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018

mengqiy mentioned this pull request Dec 27, 2018

controller histogram buckets configuration kubernetes-sigs/controller-runtime#258

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 18, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2019

wojtek-t closed this Mar 11, 2019

brancz mentioned this pull request Aug 7, 2019

Kubernetes Metrics Overhaul kubernetes/enhancements#1206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change latency bucket size for API server metrics #67476

Change latency bucket size for API server metrics #67476

mikkeloscar commented Aug 16, 2018 •

edited

mikkeloscar commented Aug 16, 2018

wojtek-t commented Aug 16, 2018

shyamjvs commented Aug 16, 2018

yue9944882 Aug 16, 2018

mikkeloscar Aug 16, 2018

yue9944882 Aug 16, 2018

lavalamp Aug 17, 2018

ehashman Oct 12, 2018 •

edited

yue9944882 commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

shyamjvs commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

brancz commented Aug 16, 2018

smarterclayton commented Aug 16, 2018 via email

neolit123 commented Aug 16, 2018

brancz commented Aug 17, 2018

smarterclayton commented Aug 17, 2018 via email

lavalamp Aug 17, 2018

yue9944882 Aug 18, 2018 •

edited

mikkeloscar Aug 21, 2018

lavalamp commented Aug 17, 2018

mikkeloscar commented Aug 21, 2018

matthiasr commented Sep 8, 2018

wenjiaswe commented Sep 24, 2018

jrake-revelant commented Nov 21, 2018

fejta-bot commented Mar 7, 2019

wojtek-t commented Mar 11, 2019

Change latency bucket size for API server metrics #67476

Change latency bucket size for API server metrics #67476

Conversation

mikkeloscar commented Aug 16, 2018 • edited

mikkeloscar commented Aug 16, 2018

wojtek-t commented Aug 16, 2018

shyamjvs commented Aug 16, 2018

yue9944882 Aug 16, 2018

Choose a reason for hiding this comment

mikkeloscar Aug 16, 2018

Choose a reason for hiding this comment

yue9944882 Aug 16, 2018

Choose a reason for hiding this comment

lavalamp Aug 17, 2018

Choose a reason for hiding this comment

ehashman Oct 12, 2018 • edited

Choose a reason for hiding this comment

yue9944882 commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

shyamjvs commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

brancz commented Aug 16, 2018

smarterclayton commented Aug 16, 2018 via email

neolit123 commented Aug 16, 2018

brancz commented Aug 17, 2018

smarterclayton commented Aug 17, 2018 via email

lavalamp Aug 17, 2018

Choose a reason for hiding this comment

yue9944882 Aug 18, 2018 • edited

Choose a reason for hiding this comment

mikkeloscar Aug 21, 2018

Choose a reason for hiding this comment

lavalamp commented Aug 17, 2018

mikkeloscar commented Aug 21, 2018

matthiasr commented Sep 8, 2018

wenjiaswe commented Sep 24, 2018

jrake-revelant commented Nov 21, 2018

fejta-bot commented Mar 7, 2019

wojtek-t commented Mar 11, 2019

mikkeloscar commented Aug 16, 2018 •

edited

ehashman Oct 12, 2018 •

edited

yue9944882 Aug 18, 2018 •

edited