Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Cardinality Enforcement #1692

Merged
merged 8 commits into from May 19, 2020

Conversation

logicalhan
Copy link
Member

This KEP intends to propose a strategy/feature so that we can introduce cardinality constraints for existing metric labels in such a way that we don't have to do it manually in the code base, so that we can decouple metric fixes from the Kubernetes release cycle.

/sig instrumentation

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 16, 2020
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 16, 2020
@logicalhan
Copy link
Member Author

/cc @lilic @brancz

This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this:

```bash
$ kube-apiserver --bind-metric-labels "some_metric=label_too_many_values"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not terribly crazy about my wording on the flag here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about --supported-label-values or --accepted-label-values?

This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this:

```bash
$ kube-apiserver --bind-metric-labels "some_metric=label_too_many_values"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about --supported-label-values or --accepted-label-values?

@erain
Copy link

erain commented Apr 16, 2020

/cc @erain

@k8s-ci-robot
Copy link
Contributor

@erain: GitHub didn't allow me to request PR reviews from the following users: erain.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @erain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


TLDR; metrics with unbounded dimensions can cause memory issues in the components they instrument.

The simple solution to this problem is to say "don't do that". We (SIG instrumentation) have already done in our instrumentation guidelines, which specifically states that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like more of a recommendation than a hard rule. Even in components we own (e.g. kube-state-metrics), there are labels with unbounded cardinality (e.g. pod or namespace). If we really wanted to prevent unbounded cardinality, we would do so at compile-time and do another migration. But I don't think that is actually what we want... We need to be able to use labels with unbounded cardinality in some places, but want an escape hatch for when we mess up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely the node/pod stuff is exceptional. But you guys are doing manual GC for metrics no? Manual cleanup of metrics doesn't really feel to me like the general use-case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. So if we can definitively make the statement that all calls to metrics.NewGaugeVec (for example) should have labels with a specific set of values, should we start enforcing that all metrics have a whitelist at compile-time? It would obviously be more intrusive than whats proposed here, but wouldn't that actually allow us to get to 100% bounded cardinality metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but emphasis on the intrusive part. I don't know how realistically achievable that would be since it would mean auditing/updating every single metric in the codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW kube-state-metrics is so special it doesn't even use the prometheus client anymore, but as already mentioned there are other exceptions to the rule like kubelet_node_name, which is practically bound to 1 series per node, but the value is free form in itself.

Comment on lines 244 to 245
This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder comment for the thing that @brancz and @lilic brought up in the meeting today, re: special casing buckets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we would want to treat buckets completely separately (as in a separate flag just for bucket configuration of histograms). @bboreham opened the original PR for apiserver request duration bucket reduction, maybe he has some input as well.

My biggest concern I think with all of this is, it's going to be super easy to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore. I'd like to make sure we emphasize that these flags are really only meant to be used as escape hatches, and we must always strive to truly fix the root of the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it clearer that this is intended to be an escape hatch. I can't really imagine many cases where someone would make such drastic changes to their own metrics though without realizing that they are making drastic changes to their own metrics (hence affecting their alerts and such).

That feels inherently a lot less dangerous since you can't get to that state without some intention from a cluster admin.

@piosz piosz requested review from x13n and serathius April 17, 2020 08:41

```

Since we already have an interception layer built into our Kubernetes monitoring stack (from the metrics stability effort), we can leverage existing wrappers to provide a global entrypoint for intercepting and enforcing rules for individual metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have an interception layer, why not simply enforce the cardinality limit there directly?

If we could specify that a certain metric allows up to N unique label sets (or K values per label), an attempt to add anything beyond that would just use some predefined constant instead. This has the downside (compared to your proposal) of being less predictable, because you never know what will get filtered out. However, in practice if K/N are sufficiently high, this shouldn't be a concern (all frequently used label values will show up early, so will be reported). Additionally, the approach I'm suggesting will:

  • enforce that your metrics "just work" instead of putting cluster admins in a position where they have to react based on alerts, when something is already broken.
  • allow monitoring entities that are of high cardinality, but when interesting label values cannot be listed up front.

If the "not predictable enough" bit is problematic, we can also combine both ideas and treat whitelists as "always working" and enforce explicit cardinality limits only on values that were not whitelisted. This would slightly change the semantics of whitelisting - lack of a whitelist would be equivalent to an empty whitelist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered this for the recent security vulnerability. The problem is exactly as you describe which is it is 'not predictable enough'. We basically lose deterministic metric output if we adopt this (not to mention metric fidelity). While the remaining metrics will 'work' in the sense that they will be bound to N number of dimensions and will thus not cause memory leaks, they will not be reliable, since the N number of dimensions you end up getting will be determined at runtime.

The alternative you mention is problematic from a practical perspective (it is an alternative way of implementing the thing that @dashpole mentioned in an earlier comment). We'd effectively be blanking out all label values for every metric in the Kubernetes codebase until people explicitly laid out a whitelist. That means every single metric would have to be audited, which is quite invasive.

Alternatively, we could use a ratcheting approach, where we enforce that all new metrics must have a whitelist explicitly specified (this could potentially be a requirement of a metric being promoted to the STABLE class).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can think of my proposal as an extension to yours, such that instead of getting label_too_many_values right away, this would still work until certain label cardinality limit is reached. Whitelisting would guarantee a value will not be dropped, but other values wouldn't be dropped either unless there is too many of them. Cluster admin can configure the per metric and per label limits once and can get alerted on "some metric labels are dropped" instead of "your metrics storage is getting blown up".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of specifying a cardinality limit instead of explicit whitelist has a lot of benefits. It allows us to potentially turn any memory leak, which is a DOS security vulnerability, into a degradation of metrics, which is easy to monitor.

What if we adopted a form of both approaches? We could have a global per-label cardinality limit e.g. 1000 values. IMO it shouldn't even be configurable. It should be high enough that we are confident such a label is a "bug". This would bound the number of metric streams at compile time since metrics, labels, and now label values are all bounded. However, we should have some way for operators to "save" their metrics when labels start being messed up by the cardinality limit. This is where this proposal could help. It doesn't allow exceeding the cardinality limit; it just allows specifying which labels are kept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a cardinality limit plus a whitelist is essentially what I'm suggesting. I think we should have both per-label and per-metric though. 1000 label values on every label works fine when there is a single problematic label, but without per-metric limits, any metric having multiple labels can still eat a ton of memory.

This would be slightly more complicated to implement correctly than purely per-label limiting approach, but would give us a realistic hard memory limit for every metric. If I have 10 labels, each of them allowing up to 1000 values, the number of possible label combinations is bound, but we can eat entire memory on any machine anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n's comment is actually why I don't want to have the limit. Something like 1000 series limit isn't really going to help since it's the multiplicative effect of the label values which cause cardinality issues.

Also, I am personally wary of setting a limit for something like request_total or request_duration since they are disproportionately large metrics relevant to the norm. We have like 50 buckets on durations, so I suspect we hit 1k timeseries pretty easily/often.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, it's a slightly orthogonal dimension to the KEP. It sounds like people are actually on board with the label whitelist approach, doing so will not prevent us from also doing stuff like introducing label limits in the future (it's not something I am completely convinced of at the moment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. I was thinking that as well. We can definitely consider each (label whitelist or cardinality limit) separately, and i'm in favor of the current proposed whitelist mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, limiting the cardinality can be done (or not) in a separate proposal, to switch from reacting to preventing issues. Whitelisting itself is indeed already giving cluster admins a way of manually mitigating large metric issues on selected metrics.

@logicalhan logicalhan changed the title [WIP] Dynamic Cardinality Enforcement Dynamic Cardinality Enforcement Apr 29, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2020
Copy link
Member

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tl;dr if I am understanding it all completely:

  • we would implement helper to enforce cardinality
  • component owners would enforce cardinality by default for certain metrics
  • flag would allow users to disable that enforcement for passed in allow-list of labels?

@logicalhan
Copy link
Member Author

  • we would implement helper to enforce cardinality

Yes

  • component owners would enforce cardinality by default for certain metrics

This one is TBD, but it has been suggested. I think there is a reasonable argument for this as a stability requirement (excluding ad hoc custom collectors for reasons mentioned in comments above).

  • flag would allow users to disable that enforcement for passed in allow-list of labels?

Oh, actually, the flag would enable one to specify a whitelist of allowed labels. By default, alpha metrics would not have any such restrictions.


- @dashpole

> Should have labels with a specific set of values, should we start enforcing that all metrics have a whitelist at compile-time?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, or merged with @x13n's comment below. As discussed above can't enforce that all labels have a whitelist, as some labels are determined at runtime (e.g. node name).

@lilic
Copy link
Member

lilic commented May 19, 2020

As discussed with @logicalhan I reworded a few things to help get this merged in by todays KEP deadline. @kubernetes/sig-instrumentation-feature-requests please take a look, thanks!

The main changes:

  • wording
  • clarified that counter type metric will not be effected by invalidity but gauge will be
  • removed open questions as they seem to have been solved in their respective comment threads.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 19, 2020
…fy a few points

Also removes open questions as those were solved in the discussion.
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the mechanism in general. There are a few clarifications and fixes then I'd be happy to merge this.


We will expose the machinery and tools to bind a metric's labels to a discrete set of values.

It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still track which components have adopted the mechanism and advocate for it no? If we build the mechanism and nobody ends up using it, then that kind of defeats the purpose no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should track it. I am personally inclined to implement this for apimachinery, so it will be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works for me

@brancz
Copy link
Member

brancz commented May 19, 2020

/lgtm
/hold

Thanks a lot @lilic and @logicalhan for pushing this. This lgtm now.

Giving at least one other instrumentation person a chance to review though. Feel free to remove hold when that's done.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
some_metric{label_too_many_values="1"} 1
some_metric{label_too_many_values="2"} 1
some_metric{label_too_many_values="3"} 1
some_metric{label_too_many_values="unexpected"} 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we blacklist this label value across the code base? Or are we ok with collisions with the unexpected label.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean across metrics? My original intent was to target whitelists for a metric label (so the conjunction of 'metric' and 'label_too_many_values'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, should we disallow calling WithLabelValues("unexpected")? It doesn't return an error today, so i'm not entirely sure how we would do it. Maybe we should log a warning when someone does that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the default behavior just be the right one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior would be to lump together label values that are the string literal "unexpected" with label values that were filtered out with the whitelist. Given that "unexpected" is now a special cased label, i'm suggesting we could/should discourage its use as a normal label value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have to explicitly whitelist a label value called 'unexpected' for that to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It probably wouldn't ever happen. Someone would have to use the "unexpected" value in a metric label, and then want to whitelist values for that label, including the literal "unexpected"...

@dashpole
Copy link
Contributor

My comment above is not blocking. This lgtm. Feel free to remove the hold when you are satisfied.

@dashpole
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@dashpole
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit f13935c into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
@ehashman
Copy link
Member

/lgtm

(for completeness)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants