-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace url label in rest client latency metrics by host and path #106539
Replace url label in rest client latency metrics by host and path #106539
Conversation
/sig instrumentation |
@@ -140,7 +140,7 @@ type latencyAdapter struct { | |||
} | |||
|
|||
func (l *latencyAdapter) Observe(ctx context.Context, verb string, u url.URL, latency time.Duration) { | |||
l.m.WithContext(ctx).WithLabelValues(verb, u.String()).Observe(latency.Seconds()) | |||
l.m.WithContext(ctx).WithLabelValues(verb, u.Path).Observe(latency.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 let's see what CI says
/cc @logicalhan @dashpole I will go and build a kubelet and scrape it to see what the before and after look like. |
/triage accepted |
how did it look? |
Friendly ping @ehashman |
After discussing with @aojea about the use case that we could have for this metric, it seems that both the host and the path are valuable information. Having both under the same |
670723a
to
d06a5dc
Compare
4b64962
to
80a78a7
Compare
80a78a7
to
42ed75c
Compare
Actually, we don't need the level of granularity offered by the path for our downstream use case yet, but I think it might still be useful to upstream. wdyt? Essentially, that would allow setting up finer-grained SLOs. /cc @wojtek-t |
unrelated /test pull-kubernetes-e2e-kind-ipv6 +1 (LGTM) |
/test pull-kubernetes-e2e-kind-ipv6 |
@@ -140,7 +141,7 @@ type latencyAdapter struct { | |||
} | |||
|
|||
func (l *latencyAdapter) Observe(ctx context.Context, verb string, u url.URL, latency time.Duration) { | |||
l.m.WithContext(ctx).WithLabelValues(verb, u.String()).Observe(latency.Seconds()) | |||
l.m.WithContext(ctx).WithLabelValues(verb, u.Host, u.Path).Observe(latency.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to remember to use the template thing for the path or we'll have an unbounded var here with a potential huge number of values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good point, maybe we could even enforce that here wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm it requires a request object so we won't be able to improve it here, we just have to make sure that we record the templated path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be a breaking change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but considering that the metric is only in ALPHA stage and that the label has an unbound cardinality that has been proven to explode, I think it is fine as long as it is documented in the CHANGELOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stating that things are in ALPHA isn't actually a reason, since that was the default. The criteria for determining whether to be cautious should be (1) is this actually likely to be used by people (2) how badly are we going to break them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the nature of the issue (i.e. unbounded cardinality), I actually don't disagree with you. I'm just saying we can't make the ALPHA claim as a reason for breaking something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I totally agree with that, I was just mentioning it to reinforce my point that cardinality explosion is critical enough to modify the metric even though that would break users which is something that could have been more complex if the metric was stable
I personally don't see that useful anytime soon. |
I believe people consume this metric, so this is a rather dangerous API change since it can break alerts/recording-rules which assume these labels.. Not to say I'm against the change, but I just think we should exercise a little bit of caution around changes like this? |
I will remove the
There surely might be people using these metrics, but I would rather break them and inform them via the changelog rather than keeping metrics with unbound cardinality that have been proven to impact monitoring platforms: prometheus-operator/kube-prometheus#1499 |
42ed75c
to
efda718
Compare
The `rest_client_request_duration_seconds` and `rest_client_rate_limiter_duration_seconds` metrics have a url label that used to contain the whole uri of the request. This is very dangerous and can lead to cardinality explosions since its values aren't bounded. We don't really need to expose the whole uri since these metrics are used to mesure the availability of the different proxy in front the apiserver. The most valuable information is the host to be able to differentiate between the different proxy. In the future, we might also want to add the path to be able to add some granularity, but since there is no immediate use case for that, so there is no need to add it now. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
efda718
to
a188f5b
Compare
/retest |
do we have consensus? |
Seems like a consensus has been reached on this PR, @logicalhan could we move forward with it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
But I would amend the release note to mention this is a breaking change.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgrisonnet, logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dgrisonnet @logicalhan We are seeing this issue in 1.22 and wondering if it will be backported or do we have to wait until we upgrade to 1.24? |
Sure, but we will not be able to backport this PR as is since it contains a breaking change. I will backport the patch of the values, but the label name will continue to be |
+1 on backporting. @dgrisonnet - can you elaborate on what you mean by backporting the patch of the values, but the label name will continue? |
Backport of #106539: Replace url label in rest client latency metrics by host and path
Backport of #106539: Replace url label in rest client latency metrics by host and path
What type of PR is this?
/kind bug
What this PR does / why we need it:
The
rest_client_request_duration_seconds
andrest_client_rate_limiter_duration_seconds
metrics have a url labelthat used to contain the whole uri of the request. This is very
dangerous and can lead to cardinality explosions since its values aren't
bounded. We don't really need to expose the whole uri since these
metrics are used to mesure the availability of the different proxy in
front the apiserver. The most valuable information is the host to be
able to differentiate between the different proxy. In the future, we
might also want to add the path to be able to add some granularity, but
since there is no immediate use case for that, so there is no need to
add it now.
Which issue(s) this PR fixes:
Fixes #106538
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: