Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumenting "time series examined" for queries and rules #4374

Open
matthiasr opened this Issue Jul 12, 2018 · 17 comments

Comments

Projects
None yet
3 participants
@matthiasr
Copy link
Contributor

matthiasr commented Jul 12, 2018

Proposal

As a user, I often wish to know the number of time series that were considered for a query, even if they are later discarded by label matching or operators.

The number should be the straight-up sum of the number of time series matched by all time series selectors in the expression, in the time frame of the query.

I believe passing up this information is relevant to #3922 (comment) as well, but in addition I would like to see it exposed. For queries through the API, this can be returned with the query result (or the error, if the query is aborted). For alert and recording rules, I would like to have this information in a per-rule metric, broken out by the rule group and type, the alert name or the "record" string respectively (or an equivalent identifier).

Use case. Why is this important?

The main use case, for me, is to catch rules that accidentally do not match any time series in the first place. For example, consider the alert expression rate(http_errors_total[1m]) > 0. In steady state operation, there would always be some time series with the name http_errors_total, so the number of time series matched should never be zero.

Currently, if for some reason nothing matches, then there will be no alert even if the world is on fire. This can happen for a number of reasons – a typo in the rule (which no syntax checker can catch), the application may have changed and emit a different name, the application may have been scaled down accidentally, or the scrape configuration may be wrong (potentially in a way that does not cause scrape errors).

As a Prometheus operator, I would like to be able to formulate an alert that notifies me if there are any (unexpected) rules that do not match anything in the first place. In my experience, only a small number of alert rules would need to be excluded from this, and the labelling outlined above would allow to do that. This alert would neatly cover all the various reasons an alert or recording rule has become ineffective, without having to alert on each possible cause individually.

I don't have a direct use case for the information in ad-hoc queries, although I can see it being useful to understand why a query may be expensive.

Alternative

A similar effect could be achieved by pairing each rule with another rule that checks absent(…) on every single time series matcher. However, this is error prone (won't catch typos, how do you keep them in sync?), easy to forget, more computationally expensive (everything needs to be matched again), and verbose. If we instrument PromQL to collect this information anyway, it would be a great additional benefit to expose it in this way.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 12, 2018

@SuperQ you expressed interest in this as well

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 12, 2018

in a per-rule metric,

We explicitly don't have per-rule metrics, as they'd be too high cardinality. This is debug and performance data which we could expose it in some form in the UI. It doesn't belong on /metrics as it's not information about the performance of the Prometheus server, it sounds like a linter of some form is what you want.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 13, 2018

How would a linter know what metrics and label values are being exposed by applications?

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 13, 2018

The ALERTS metric is also per-rule? I would want even less granularity than that, just one metric per rule definition (sorry, that wasn't formulated clearly)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 13, 2018

How would a linter know what metrics and label values are being exposed by applications?

It could run the queries against a live Prometheus, I've seen such things developed before for this purpose.

The ALERTS metric is also per-rule?

It's per alert, but it's not on /metrics. I don't think this sort of profiling information belongs anywhere beyond the API.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 13, 2018

Ooooh sorry now I understand that I didn't formulate that cleanly. I never intended this to be on /metrics, for the use case outlined that's not really necessary. Just recording it as a time series or something would be fine, but I do want to formulate alert rules over this.

The linter would only catch cases where the rule is off target at the time the linter is run, but not if the ground has shifted under it.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 13, 2018

I think this is a use case for an exporter, not anything inside that is automatically added to the tsdb. We're going to have quite a few data points about query execution in future (tens to hundreds, depending on the query), and I don't think it's appropriate to bloat the tsdb with those for every rule a user has.

In addition, there is no sane identifier for a rule as they are permitted to have duplicate names.

This is best solved outside Prometheus itself.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 13, 2018

Fair point, we can work out the exact implementation of this then. The main point for me was that I would like to have this particular piece of data collected and available in the first place.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 13, 2018

It would be kind of weird to have a prometheus_exporter though. Maybe this is something we could build in on a separate on-demand metrics endpoint, possibly with selectable collectors/modules?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 13, 2018

My current idea is to include these sort of stats with the existing rule/query duration stats on the Rule status page (and thus the API when that gets in), and beyond that offer per-promql node in some sane way too.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jul 18, 2018

What about a «promtool explain» command that would run a query against a specific api endpoint that would give all that info?

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jul 18, 2018

e.g.

$ promtool explain 'up unless on(job,instance) rate(my_great_metric[5m])'
Query took: 2s
> up: 1s / 1023 instant timeseries
> rate(my_great_metric[5m])
  > rate: 1s
  > my_great_metric[5m]: 233 timeseries x 10 datapoints
@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jul 18, 2018

That would be useful while developing / checking expressions, but since rules are being evaluated constantly anyway I'd rather have access to the statistics of that in some way.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jul 18, 2018

rules are just the tip of the iceberg.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 18, 2018

Once these stats are there I'd expect it to end up in the U (for rules)I, and in the APIs (for rules and query/query_range).

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Jul 18, 2018

exp

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 18, 2018

That's yet another feature again, which is not what we're talking about here (but I've also on the cards). With the current browser expression UI these would be additional stats in the top right corner with the series returned and duration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.