New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Cardinality Enforcement #1692
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
85c5a72
to
1833b03
Compare
1833b03
to
620e387
Compare
620e387
to
d103b95
Compare
This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this: | ||
|
||
```bash | ||
$ kube-apiserver --bind-metric-labels "some_metric=label_too_many_values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not terribly crazy about my wording on the flag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about --supported-label-values
or --accepted-label-values
?
This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this: | ||
|
||
```bash | ||
$ kube-apiserver --bind-metric-labels "some_metric=label_too_many_values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about --supported-label-values
or --accepted-label-values
?
/cc @erain |
@erain: GitHub didn't allow me to request PR reviews from the following users: erain. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
TLDR; metrics with unbounded dimensions can cause memory issues in the components they instrument. | ||
|
||
The simple solution to this problem is to say "don't do that". We (SIG instrumentation) have already done in our instrumentation guidelines, which specifically states that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like more of a recommendation than a hard rule. Even in components we own (e.g. kube-state-metrics), there are labels with unbounded cardinality (e.g. pod
or namespace
). If we really wanted to prevent unbounded cardinality, we would do so at compile-time and do another migration. But I don't think that is actually what we want... We need to be able to use labels with unbounded cardinality in some places, but want an escape hatch for when we mess up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely the node/pod stuff is exceptional. But you guys are doing manual GC for metrics no? Manual cleanup of metrics doesn't really feel to me like the general use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. So if we can definitively make the statement that all calls to metrics.NewGaugeVec
(for example) should have labels with a specific set of values, should we start enforcing that all metrics have a whitelist at compile-time? It would obviously be more intrusive than whats proposed here, but wouldn't that actually allow us to get to 100% bounded cardinality metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but emphasis on the intrusive part. I don't know how realistically achievable that would be since it would mean auditing/updating every single metric in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW kube-state-metrics is so special it doesn't even use the prometheus client anymore, but as already mentioned there are other exceptions to the rule like kubelet_node_name
, which is practically bound to 1 series per node, but the value is free form in itself.
This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we would want to treat buckets completely separately (as in a separate flag just for bucket configuration of histograms). @bboreham opened the original PR for apiserver request duration bucket reduction, maybe he has some input as well.
My biggest concern I think with all of this is, it's going to be super easy to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore. I'd like to make sure we emphasize that these flags are really only meant to be used as escape hatches, and we must always strive to truly fix the root of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make it clearer that this is intended to be an escape hatch. I can't really imagine many cases where someone would make such drastic changes to their own metrics though without realizing that they are making drastic changes to their own metrics (hence affecting their alerts and such).
That feels inherently a lot less dangerous since you can't get to that state without some intention from a cluster admin.
|
||
``` | ||
|
||
Since we already have an interception layer built into our Kubernetes monitoring stack (from the metrics stability effort), we can leverage existing wrappers to provide a global entrypoint for intercepting and enforcing rules for individual metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have an interception layer, why not simply enforce the cardinality limit there directly?
If we could specify that a certain metric allows up to N unique label sets (or K values per label), an attempt to add anything beyond that would just use some predefined constant instead. This has the downside (compared to your proposal) of being less predictable, because you never know what will get filtered out. However, in practice if K/N are sufficiently high, this shouldn't be a concern (all frequently used label values will show up early, so will be reported). Additionally, the approach I'm suggesting will:
- enforce that your metrics "just work" instead of putting cluster admins in a position where they have to react based on alerts, when something is already broken.
- allow monitoring entities that are of high cardinality, but when interesting label values cannot be listed up front.
If the "not predictable enough" bit is problematic, we can also combine both ideas and treat whitelists as "always working" and enforce explicit cardinality limits only on values that were not whitelisted. This would slightly change the semantics of whitelisting - lack of a whitelist would be equivalent to an empty whitelist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We considered this for the recent security vulnerability. The problem is exactly as you describe which is it is 'not predictable enough'. We basically lose deterministic metric output if we adopt this (not to mention metric fidelity). While the remaining metrics will 'work' in the sense that they will be bound to N number of dimensions and will thus not cause memory leaks, they will not be reliable, since the N number of dimensions you end up getting will be determined at runtime.
The alternative you mention is problematic from a practical perspective (it is an alternative way of implementing the thing that @dashpole mentioned in an earlier comment). We'd effectively be blanking out all label values for every metric in the Kubernetes codebase until people explicitly laid out a whitelist. That means every single metric would have to be audited, which is quite invasive.
Alternatively, we could use a ratcheting approach, where we enforce that all new metrics must have a whitelist explicitly specified (this could potentially be a requirement of a metric being promoted to the STABLE class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can think of my proposal as an extension to yours, such that instead of getting label_too_many_values
right away, this would still work until certain label cardinality limit is reached. Whitelisting would guarantee a value will not be dropped, but other values wouldn't be dropped either unless there is too many of them. Cluster admin can configure the per metric and per label limits once and can get alerted on "some metric labels are dropped" instead of "your metrics storage is getting blown up".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of specifying a cardinality limit instead of explicit whitelist has a lot of benefits. It allows us to potentially turn any memory leak, which is a DOS security vulnerability, into a degradation of metrics, which is easy to monitor.
What if we adopted a form of both approaches? We could have a global per-label cardinality limit e.g. 1000 values. IMO it shouldn't even be configurable. It should be high enough that we are confident such a label is a "bug". This would bound the number of metric streams at compile time since metrics, labels, and now label values are all bounded. However, we should have some way for operators to "save" their metrics when labels start being messed up by the cardinality limit. This is where this proposal could help. It doesn't allow exceeding the cardinality limit; it just allows specifying which labels are kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a cardinality limit plus a whitelist is essentially what I'm suggesting. I think we should have both per-label and per-metric though. 1000 label values on every label works fine when there is a single problematic label, but without per-metric limits, any metric having multiple labels can still eat a ton of memory.
This would be slightly more complicated to implement correctly than purely per-label limiting approach, but would give us a realistic hard memory limit for every metric. If I have 10 labels, each of them allowing up to 1000 values, the number of possible label combinations is bound, but we can eat entire memory on any machine anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n's comment is actually why I don't want to have the limit. Something like 1000 series limit isn't really going to help since it's the multiplicative effect of the label values which cause cardinality issues.
Also, I am personally wary of setting a limit for something like request_total or request_duration since they are disproportionately large metrics relevant to the norm. We have like 50 buckets on durations, so I suspect we hit 1k timeseries pretty easily/often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, it's a slightly orthogonal dimension to the KEP. It sounds like people are actually on board with the label whitelist approach, doing so will not prevent us from also doing stuff like introducing label limits in the future (it's not something I am completely convinced of at the moment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. I was thinking that as well. We can definitely consider each (label whitelist or cardinality limit) separately, and i'm in favor of the current proposed whitelist mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, limiting the cardinality can be done (or not) in a separate proposal, to switch from reacting to preventing issues. Whitelisting itself is indeed already giving cluster admins a way of manually mitigating large metric issues on selected metrics.
17deca1
to
6551a76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tl;dr if I am understanding it all completely:
- we would implement helper to enforce cardinality
- component owners would enforce cardinality by default for certain metrics
- flag would allow users to disable that enforcement for passed in allow-list of labels?
Yes
This one is TBD, but it has been suggested. I think there is a reasonable argument for this as a stability requirement (excluding ad hoc custom collectors for reasons mentioned in comments above).
Oh, actually, the flag would enable one to specify a whitelist of allowed labels. By default, alpha metrics would not have any such restrictions. |
|
||
- @dashpole | ||
|
||
> Should have labels with a specific set of values, should we start enforcing that all metrics have a whitelist at compile-time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, or merged with @x13n's comment below. As discussed above can't enforce that all labels have a whitelist, as some labels are determined at runtime (e.g. node name).
As discussed with @logicalhan I reworded a few things to help get this merged in by todays KEP deadline. @kubernetes/sig-instrumentation-feature-requests please take a look, thanks! The main changes:
|
…fy a few points Also removes open questions as those were solved in the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the mechanism in general. There are a few clarifications and fixes then I'd be happy to merge this.
|
||
We will expose the machinery and tools to bind a metric's labels to a discrete set of values. | ||
|
||
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still track which components have adopted the mechanism and advocate for it no? If we build the mechanism and nobody ends up using it, then that kind of defeats the purpose no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should track it. I am personally inclined to implement this for apimachinery, so it will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works for me
/lgtm Thanks a lot @lilic and @logicalhan for pushing this. This lgtm now. Giving at least one other instrumentation person a chance to review though. Feel free to remove hold when that's done. |
some_metric{label_too_many_values="1"} 1 | ||
some_metric{label_too_many_values="2"} 1 | ||
some_metric{label_too_many_values="3"} 1 | ||
some_metric{label_too_many_values="unexpected"} 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we blacklist this label value across the code base? Or are we ok with collisions with the unexpected label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean across metrics? My original intent was to target whitelists for a metric label (so the conjunction of 'metric' and 'label_too_many_values'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, should we disallow calling WithLabelValues("unexpected")
? It doesn't return an error today, so i'm not entirely sure how we would do it. Maybe we should log a warning when someone does that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the default behavior just be the right one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior would be to lump together label values that are the string literal "unexpected" with label values that were filtered out with the whitelist. Given that "unexpected" is now a special cased label, i'm suggesting we could/should discourage its use as a normal label value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to explicitly whitelist a label value called 'unexpected' for that to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It probably wouldn't ever happen. Someone would have to use the "unexpected" value in a metric label, and then want to whitelist values for that label, including the literal "unexpected"...
My comment above is not blocking. This lgtm. Feel free to remove the hold when you are satisfied. |
/hold cancel |
/lgtm |
/lgtm (for completeness) |
This KEP intends to propose a strategy/feature so that we can introduce cardinality constraints for existing metric labels in such a way that we don't have to do it manually in the code base, so that we can decouple metric fixes from the Kubernetes release cycle.
/sig instrumentation