Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upInstrumenting "time series examined" for queries and rules #4374
Comments
matthiasr
added
kind/enhancement
component/promql
labels
Jul 12, 2018
This comment has been minimized.
This comment has been minimized.
|
@SuperQ you expressed interest in this as well |
This comment has been minimized.
This comment has been minimized.
We explicitly don't have per-rule metrics, as they'd be too high cardinality. This is debug and performance data which we could expose it in some form in the UI. It doesn't belong on /metrics as it's not information about the performance of the Prometheus server, it sounds like a linter of some form is what you want. |
brian-brazil
added
the
priority/Pmaybe
label
Jul 12, 2018
This comment has been minimized.
This comment has been minimized.
|
How would a linter know what metrics and label values are being exposed by applications? |
This comment has been minimized.
This comment has been minimized.
|
The ALERTS metric is also per-rule? I would want even less granularity than that, just one metric per rule definition (sorry, that wasn't formulated clearly) |
This comment has been minimized.
This comment has been minimized.
It could run the queries against a live Prometheus, I've seen such things developed before for this purpose.
It's per alert, but it's not on /metrics. I don't think this sort of profiling information belongs anywhere beyond the API. |
This comment has been minimized.
This comment has been minimized.
|
Ooooh sorry now I understand that I didn't formulate that cleanly. I never intended this to be on The linter would only catch cases where the rule is off target at the time the linter is run, but not if the ground has shifted under it. |
This comment has been minimized.
This comment has been minimized.
|
I think this is a use case for an exporter, not anything inside that is automatically added to the tsdb. We're going to have quite a few data points about query execution in future (tens to hundreds, depending on the query), and I don't think it's appropriate to bloat the tsdb with those for every rule a user has. In addition, there is no sane identifier for a rule as they are permitted to have duplicate names. This is best solved outside Prometheus itself. |
This comment has been minimized.
This comment has been minimized.
|
Fair point, we can work out the exact implementation of this then. The main point for me was that I would like to have this particular piece of data collected and available in the first place. |
This comment has been minimized.
This comment has been minimized.
|
It would be kind of weird to have a |
This comment has been minimized.
This comment has been minimized.
|
My current idea is to include these sort of stats with the existing rule/query duration stats on the Rule status page (and thus the API when that gets in), and beyond that offer per-promql node in some sane way too. |
This comment has been minimized.
This comment has been minimized.
|
What about a «promtool explain» command that would run a query against a specific api endpoint that would give all that info? |
This comment has been minimized.
This comment has been minimized.
|
e.g.
|
This comment has been minimized.
This comment has been minimized.
|
That would be useful while developing / checking expressions, but since rules are being evaluated constantly anyway I'd rather have access to the statistics of that in some way. |
This comment has been minimized.
This comment has been minimized.
|
rules are just the tip of the iceberg. |
This comment has been minimized.
This comment has been minimized.
|
Once these stats are there I'd expect it to end up in the U (for rules)I, and in the APIs (for rules and query/query_range). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
That's yet another feature again, which is not what we're talking about here (but I've also on the cards). With the current browser expression UI these would be additional stats in the top right corner with the series returned and duration. |

matthiasr commentedJul 12, 2018
Proposal
As a user, I often wish to know the number of time series that were considered for a query, even if they are later discarded by label matching or operators.
The number should be the straight-up sum of the number of time series matched by all time series selectors in the expression, in the time frame of the query.
I believe passing up this information is relevant to #3922 (comment) as well, but in addition I would like to see it exposed. For queries through the API, this can be returned with the query result (or the error, if the query is aborted). For alert and recording rules, I would like to have this information in a per-rule metric, broken out by the rule group and type, the alert name or the "record" string respectively (or an equivalent identifier).
Use case. Why is this important?
The main use case, for me, is to catch rules that accidentally do not match any time series in the first place. For example, consider the alert expression
rate(http_errors_total[1m]) > 0. In steady state operation, there would always be some time series with the namehttp_errors_total, so the number of time series matched should never be zero.Currently, if for some reason nothing matches, then there will be no alert even if the world is on fire. This can happen for a number of reasons – a typo in the rule (which no syntax checker can catch), the application may have changed and emit a different name, the application may have been scaled down accidentally, or the scrape configuration may be wrong (potentially in a way that does not cause scrape errors).
As a Prometheus operator, I would like to be able to formulate an alert that notifies me if there are any (unexpected) rules that do not match anything in the first place. In my experience, only a small number of alert rules would need to be excluded from this, and the labelling outlined above would allow to do that. This alert would neatly cover all the various reasons an alert or recording rule has become ineffective, without having to alert on each possible cause individually.
I don't have a direct use case for the information in ad-hoc queries, although I can see it being useful to understand why a query may be expensive.
Alternative
A similar effect could be achieved by pairing each rule with another rule that checks
absent(…)on every single time series matcher. However, this is error prone (won't catch typos, how do you keep them in sync?), easy to forget, more computationally expensive (everything needs to be matched again), and verbose. If we instrument PromQL to collect this information anyway, it would be a great additional benefit to expose it in this way.