New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change latency bucket size for API server metrics #67476
Conversation
/assign brancz |
I'm not an expert, but isn't this a breaking change? |
Name: "apiserver_request_latency_seconds", | ||
Help: "Response latency distribution in seconds for each verb, resource and subresource.", | ||
// Use buckets ranging from 5 ms to 10 seconds. | ||
Buckets: prometheus.DefBuckets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @mikkeloscar, thanks for fixing this.
should be microseconds, not seconds. *10^6
ExponentialBuckets(125000, 2.0, 7): [125000 250000 500000 1e+06 2e+06 4e+06 8e+06]
DefBuckets: [0.005 0.01 0.025 0.05 0.1 0.25 0.5 1 2.5 5 10]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the unit to seconds as suggested by @brancz #63750 (comment)
Change in unit is here: https://github.com/kubernetes/kubernetes/pull/67476/files#diff-c4dd16fa62ccf218ce751334493c3b3eR196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh.. tricked.. 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a few more buckets, maybe 25s and 50s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do; the 8s limit is messing up histogram calculations for verbs like WATCH and CONNECT which tend to have super long latencies. It would probably be valuable to have the largest bucket coincide with the global timeout default (currently 60s iirc).
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mikkeloscar If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changing the buckets may still be fine (I'm not sure), but renaming metrics does sound like a breaking change. Please hold until this is sorted out. |
@mikkeloscar: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We don't have any stability guarantees on metrics and we have broken them in various ways in the past, which doesn't mean we should continue to do so for no reason, but as long as breaking metrics puts them in line with the official instrumentation guidelines, I think a change is for the better. Changing the buckets already breaks the semantics of the metric and as a user of any system I prefer a hard break, rather than a subtle one. |
We haven't taken a hard line on changing metrics. So far we have treated
them under a reduced API guarantee - extra scrutiny, but not requiring them
to not change.
I would probably lean towards it being ok to change, but to ensure we
widely communicate the change in the metric as part of release notes, that
we ensure we are communicating to operators in other mediums to make them
aware, and that we not back port those changes unless necessary.
…On Thu, Aug 16, 2018 at 7:40 AM Frederic Branczyk ***@***.***> wrote:
We don't have any stability guarantees on metrics and we have broken them
in various ways in the past, which doesn't mean we should continue to do so
for no reason, but as long as breaking metrics puts them in line with the official
instrumentation guidelines
<https://github.com/kubernetes/community/blob/master/contributors/devel/instrumentation.md>,
I think a change is for the better.
Changing the buckets already breaks the semantics of the metric and as a
user of any system I prefer a hard break, rather than a subtle one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#67476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pyUTOmYHn1J7dws_-yW8m9D64Bmwks5uRVo2gaJpZM4V_Mwx>
.
|
/sig api-machinery |
I completely agree that this needs to be widely and clearly communicated. We have been discussing in sig-instrumentation to do a general overhaul of all the metrics we expose to at least ensure everything is in line with our instrumentation guidelines. What does everyone think of instead of doing one off improvements, instead collecting all the metrics we would like to change and do it in one go? There are numerous violations across the codebase, in terms of metric naming, label naming and non-base units being used, unifying this will give users a better experience working with metrics exposed by Kubernetes. |
That seems reasonable if sig driven, and the more changes that happen
together the less likely an individual user is to not notice something
important happened
…On Fri, Aug 17, 2018 at 3:45 AM Frederic Branczyk ***@***.***> wrote:
I completely agree that this needs to be widely and clearly communicated.
We have been discussing in sig-instrumentation to do a general overhaul of
all the metrics we expose to at least ensure everything is in line with our
instrumentation guidelines. What does everyone think of instead of doing
one off improvements, instead collecting all the metrics we would like to
change and do it in one go? There are numerous violations across the
codebase, in terms of metric naming, label naming and non-base units being
used, unifying this will give users a better experience working with
metrics exposed by Kubernetes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#67476 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p2zkmV4ucZBc4O_7HardXnvcVbGBks5uRnSGgaJpZM4V_Mwx>
.
|
@@ -193,10 +193,10 @@ func RecordLongRunning(req *http.Request, requestInfo *request.RequestInfo, fn f | |||
func MonitorRequest(req *http.Request, verb, resource, subresource, scope, contentType string, httpCode, respSize int, elapsed time.Duration) { | |||
reportedVerb := cleanVerb(verb, req) | |||
client := cleanUserAgent(utilnet.GetHTTPClient(req)) | |||
elapsedMicroseconds := float64(elapsed / time.Microsecond) | |||
elapsedSeconds := float64(elapsed / time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test this until you find the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikkeloscar the underlying type of time.Duration
is int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yue9944882 thanks for the hint, fixed.
Making a batch update for the metrics changes sounds good, especially if it comes with an easy-to-understand breakdown of the changes. |
So how do we continue here? Do you want to consider this individual change or do you want to make it part of a batch update? Regarding the rename and breaking backwards compatibility I totally understand your concerns and as a user of Kubernetes (or any system really) I'm also in favor of stability. However, the fact that my original issue (#63750) didn't get any comments for 3 months and no-one reported a similar issue afaik, suggest to me that not many people are using these metrics at this moment, or if they are, they don't know that they are wrong :) |
👋 user of these metrics here – in our case most request latencies are longer than 125ms so the existing bucketing is useful, that's why we did not see/comment on #63750. This is not to say that it could or should not be improved, just verifying that these metrics are being used and communication around updating them is necessary. A major renaming is painful but better than many subtle changes in my opinion. |
Hello folks, what is the status of the review? |
For the `apiserver_request_latencies` metric, the histogram buckets defined were in the range 125ms to 8s. This causes the metrics to be very skewed if the service is much faster than the 125ms minimum. Prometheus client library provides default buckets in the range 5ms to 10s which is more sensible for a range of different environment. > The default buckets are tailored to broadly measure the response time > (in seconds) of a network service. This changes the bucket sizes for the `apiserver_request_latencies` metric to the defaults provided by prometheus and also changes the unit from microseconds to seconds. This is reflected by changing the metric names: * `apiserver_request_latencies` -> `apiserver_request_latency_seconds` * `apiserver_request_latencies_summary` -> `apiserver_request_latency_seconds_summary` Fix kubernetes#63750 Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>
Signed-off-by: Mikkel Oscar Lyderik Larsen <m@moscar.net>
b63fa02
to
e8b825b
Compare
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
This has been fixed in the meantime. |
What this PR does / why we need it:
For the
apiserver_request_latencies
metric, the histogram buckets defined were in the range 125ms to 8s. This causes the metrics to be very skewed if the service is much faster than the 125ms minimum.Prometheus client library provides default buckets in the range 5ms to 10s which is more sensible for a range of different environment.
This changes the bucket sizes for the
apiserver_request_latencies
metric to the defaults provided by prometheus and also changes the unit from microseconds to seconds.This is reflected by changing the metric names:
apiserver_request_latencies
->apiserver_request_latency_seconds
apiserver_request_latencies_summary
->apiserver_request_latency_seconds_summary
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #63750
Special notes for your reviewer:
/cc @brancz
Release note: