Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace url label in rest client latency metrics by host and path #106539

Merged
merged 1 commit into from
Feb 17, 2022

Conversation

dgrisonnet
Copy link
Member

@dgrisonnet dgrisonnet commented Nov 18, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

The rest_client_request_duration_seconds and
rest_client_rate_limiter_duration_seconds metrics have a url label
that used to contain the whole uri of the request. This is very
dangerous and can lead to cardinality explosions since its values aren't
bounded. We don't really need to expose the whole uri since these
metrics are used to mesure the availability of the different proxy in
front the apiserver. The most valuable information is the host to be
able to differentiate between the different proxy. In the future, we
might also want to add the path to be able to add some granularity, but
since there is no immediate use case for that, so there is no need to
add it now.

Which issue(s) this PR fixes:

Fixes #106538

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Replace the url label of rest_client_request_duration_seconds and rest_client_rate_limiter_duration_seconds metrics with a host label to prevent cardinality explosions and keep only the useful information. This is a breaking change required for security reasons.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2021
@dgrisonnet
Copy link
Member Author

/sig instrumentation
/assign @ehashman

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2021
@@ -140,7 +140,7 @@ type latencyAdapter struct {
}

func (l *latencyAdapter) Observe(ctx context.Context, verb string, u url.URL, latency time.Duration) {
l.m.WithContext(ctx).WithLabelValues(verb, u.String()).Observe(latency.Seconds())
l.m.WithContext(ctx).WithLabelValues(verb, u.Path).Observe(latency.Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 let's see what CI says

@ehashman
Copy link
Member

/cc @logicalhan @dashpole
wdyt?

I will go and build a kubelet and scrape it to see what the before and after look like.

@ehashman
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 19, 2021
@logicalhan
Copy link
Member

/cc @logicalhan @dashpole wdyt?

I will go and build a kubelet and scrape it to see what the before and after look like.

how did it look?

@dgrisonnet
Copy link
Member Author

Friendly ping @ehashman

@dgrisonnet
Copy link
Member Author

After discussing with @aojea about the use case that we could have for this metric, it seems that both the host and the path are valuable information. Having both under the same url label makes it harder to filter since it will require doing some post-computation to select only the path or only the host. As such, I think we should introduce two new labels host and path instead of url.

@dgrisonnet dgrisonnet changed the title component-base/metrics: prune url in rest client latency metrics Replace url label in rest client latency metrics by host and path Dec 15, 2021
@dgrisonnet dgrisonnet force-pushed the rest-client-latency branch 2 times, most recently from 4b64962 to 80a78a7 Compare December 15, 2021 11:05
@dgrisonnet dgrisonnet changed the title Replace url label in rest client latency metrics by host Replace url label in rest client latency metrics by host and path Dec 15, 2021
@dgrisonnet
Copy link
Member Author

Actually, we don't need the level of granularity offered by the path for our downstream use case yet, but I think it might still be useful to upstream. wdyt? Essentially, that would allow setting up finer-grained SLOs.

/cc @wojtek-t

@aojea
Copy link
Member

aojea commented Dec 15, 2021

Kubernetes e2e suite: [sig-node] Pods Extended Pod Container lifecycle should not create extra sandbox if all containers are done expand_less

unrelated

/test pull-kubernetes-e2e-kind-ipv6

+1 (LGTM)

@aojea
Copy link
Member

aojea commented Dec 16, 2021

/test pull-kubernetes-e2e-kind-ipv6

@@ -140,7 +141,7 @@ type latencyAdapter struct {
}

func (l *latencyAdapter) Observe(ctx context.Context, verb string, u url.URL, latency time.Duration) {
l.m.WithContext(ctx).WithLabelValues(verb, u.String()).Observe(latency.Seconds())
l.m.WithContext(ctx).WithLabelValues(verb, u.Host, u.Path).Observe(latency.Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to remember to use the template thing for the path or we'll have an unbounded var here with a potential huge number of values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point, maybe we could even enforce that here wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm it requires a request object so we won't be able to improve it here, we just have to make sure that we record the templated path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be a breaking change...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but considering that the metric is only in ALPHA stage and that the label has an unbound cardinality that has been proven to explode, I think it is fine as long as it is documented in the CHANGELOG

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stating that things are in ALPHA isn't actually a reason, since that was the default. The criteria for determining whether to be cautious should be (1) is this actually likely to be used by people (2) how badly are we going to break them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the nature of the issue (i.e. unbounded cardinality), I actually don't disagree with you. I'm just saying we can't make the ALPHA claim as a reason for breaking something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I totally agree with that, I was just mentioning it to reinforce my point that cardinality explosion is critical enough to modify the metric even though that would break users which is something that could have been more complex if the metric was stable

@wojtek-t
Copy link
Member

wojtek-t commented Jan 3, 2022

Actually, we don't need the level of granularity offered by the path for our downstream use case yet, but I think it might still be useful to upstream. wdyt? Essentially, that would allow setting up finer-grained SLOs.

/cc @wojtek-t

I personally don't see that useful anytime soon.
That said, if we're not afraid about the effect of unbounded cardinality (I didn't look carefully at the PR), I'm not against either.

@logicalhan
Copy link
Member

I believe people consume this metric, so this is a rather dangerous API change since it can break alerts/recording-rules which assume these labels..

Not to say I'm against the change, but I just think we should exercise a little bit of caution around changes like this?

@dgrisonnet
Copy link
Member Author

I will remove the path label for now since it doesn't seem to be useful and I don't want us to forget about it in the future if it is never used.

I believe people consume this metric, so this is a rather dangerous API change since it can break alerts/recording-rules which assume these labels..

There surely might be people using these metrics, but I would rather break them and inform them via the changelog rather than keeping metrics with unbound cardinality that have been proven to impact monitoring platforms: prometheus-operator/kube-prometheus#1499

The `rest_client_request_duration_seconds` and
`rest_client_rate_limiter_duration_seconds` metrics have a url label
that used to contain the whole uri of the request. This is very
dangerous and can lead to cardinality explosions since its values aren't
bounded. We don't really need to expose the whole uri since these
metrics are used to mesure the availability of the different proxy in
front the apiserver. The most valuable information is the host to be
able to differentiate between the different proxy. In the future, we
might also want to add the path to be able to add some granularity, but
since there is no immediate use case for that, so there is no need to
add it now.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@dgrisonnet
Copy link
Member Author

/retest

@aojea
Copy link
Member

aojea commented Jan 12, 2022

do we have consensus?

@dgrisonnet
Copy link
Member Author

Seems like a consensus has been reached on this PR, @logicalhan could we move forward with it?

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

But I would amend the release note to mention this is a breaking change.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2022
@k8s-ci-robot k8s-ci-robot merged commit 6de9ddd into kubernetes:master Feb 17, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 17, 2022
@mitchellmaler
Copy link

@dgrisonnet @logicalhan We are seeing this issue in 1.22 and wondering if it will be backported or do we have to wait until we upgrade to 1.24?

@dgrisonnet
Copy link
Member Author

Sure, but we will not be able to backport this PR as is since it contains a breaking change. I will backport the patch of the values, but the label name will continue to be url in 1.22 and 1.23.

@ruiwen-zhao
Copy link
Contributor

+1 on backporting.

@dgrisonnet - can you elaborate on what you mean by backporting the patch of the values, but the label name will continue?

k8s-ci-robot added a commit that referenced this pull request May 23, 2022
Backport of #106539: Replace url label in rest client latency metrics by host and path
k8s-ci-robot added a commit that referenced this pull request May 23, 2022
Backport of #106539: Replace url label in rest client latency metrics by host and path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cardinality explosions in kubelet rest client latency metrics
8 participants