Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape-time rule evaluation #394

Open
brian-brazil opened this Issue Jun 9, 2014 · 28 comments

Comments

Projects
None yet
5 participants
@brian-brazil
Copy link
Member

brian-brazil commented Jun 9, 2014

I'm currently working with the /proc/meminfo stats, and one of the missing stats is "Memory Used" so you have to calculate it. Doing so via rules isn't a good idea due to time skew, so it'd be handy if you could specify some rules to be run at scrape time against the data for a given instance.

@camerondavison

This comment has been minimized.

Copy link

camerondavison commented Apr 28, 2015

Are you talking about something like http://prometheus.io/docs/querying/rules/ recording rules?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 28, 2015

Yes, but evaluated in the context of a target at scrape time rather than at recording rule evaluation time.

@nomis52

This comment has been minimized.

Copy link

nomis52 commented Jul 1, 2016

Yes please! I'm new to Prometheus having used a very similar system for the last decade. The ability to do standing rule evaluation was a core piece of the real time monitoring story.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Jul 14, 2017

While I think we'll need this at some point, there's no clamour for this yet so Pmaybe.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Mar 4, 2018

When implementing this it would be great to have those recording rules also able to look at the past. E.g. to calculate urates/rates.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Mar 4, 2018

irates

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Mar 4, 2018

Extra note: do we want this before of after relabelling? I'd say we want this after relabelling so that we can compare with relabelled metrics from previous scrapes.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 4, 2018

It has to be after metric relabelling, doesn't make sense otherwise.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

can we move this one step further and agree on a design?

groups:
- name: example
  job: snmp
  rules:
  - record: ifHCInOctets:irate6m
    expr: irate(ifHCInOctets{ {{range $k, $v := $scrapeLabels}}{{$k}}=\"{{$v}}\",{{end}} }[6m])
  • The timestamp would be the timestamp of the recording rule, not the scrape
  • This would be done after the scrape is written to TSDB, so this would be independant from the scrape, therefore not guaranteed to run after each scrape
  • This would not run on failed scrapes
  • We would not guarantee that they would not run on parallel, but we would guarantee that for the same target the same recording rule would not run multiple times concurrently, so best practice is that you need to only do alerting rules that use the unique labels of the scrape.

The problem in my proposal is that "expr" becomes go-templated.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

The timestamp would be the timestamp of the recording rule, not the scrape

It would be of the scrape. Why would you want a less accurate timestamp?

so this would be independant from the scrape, therefore not guaranteed to run after each scrape

We should be trying to run after each scrape. The question is more how does this interact with the scrape timeout.

This would not run on failed scrapes

It would, just like recording rules do.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

The problem in my proposal is that "expr" becomes go-templated.

That would be very difficult for a user to use. The PromQL would be tweaked in the background. This prevents anyone using honor_labels/metric_relabel_configs to take in samples that don't match the target labels - but that's I think a fundamental issue here. Such a rule can only access data from its target.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

honor_labels/metric_relabel_configs must be out of scope because with them you can get metrics that have no common labels.

Or are we ready to limit the scope of the rules to exactly the series of the job/target (and then forbid the usage of, e.g. stddev)?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

Or are we ready to limit the scope of the rules to exactly the series of the job/target (and then forbid the usage of, e.g. stddev)?

Scrape-time rules are per-target, so if you want to do stddev across targets you still need a recording rule. Stddev within a target would still work.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

That makes writing rules simpler. But how is the data about which timeserie belong to which target recorded, and persisted across restart? I guess we have that for staleness already.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

It isn't, the promql would be adjusted to have matchers for all the target labels added to it.

Something you haven't considered: How do you tell a scrape time rule which targets it applies to?

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

We set the " job: snmp" at the group level, which implies it will run after each scrape of a target in that job

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

It isn't, the promql would be adjusted to have matchers for all the target labels added to it.

That is then ignoring relabeling/honor_labels ?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

We set the " job: snmp" at the group level, which implies it will run after each scrape of a target in that job

Why is the job label special here, or do you mean the scrape config name?

That is then ignoring relabeling/honor_labels ?

It's more that the labels of the scraped data may not match the target labels.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

We set the " job: snmp" at the group level, which implies it will run after each scrape of a target in that job

Yes scrape config name, so job_name would be better.

It's more that the labels of the scraped data may not match the target labels.

So what would be the solution here?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

Yes scrape config name, so job_name would be better.

That could be an entire k8 cluster, of which you only want to apply the rule to one service. I think this should work off target labels.

So what would be the solution here?

I don't think there is one, it'd just be a known limitation.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

groups:
- name: example
  target_labels:
  - job: [snmp]
  - zone: [dmz,internet]
  rules:
  - record: ifHCInOctets:irate6m
    expr: irate(ifHCInOctets[6m])
@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

Without a limitation also on job_name, there would be a penalty on every scrape

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 13, 2018

Not really, targets aren't created often.

Templating should also not be in the expr.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 13, 2018

Okay so, we would scan the targets when they are created/changed/deleted and also when rules are created/changed/deleted.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 14, 2018

Proposed design, after this discussion:

groups:
- name: example
  target_labels:
  - job: [snmp]
  - zone: [dmz,internet]
  target_labels_re:
  - instance: ["nexus.*"]
  rules:
  - record: ifHCInOctets:irate6m
    expr: irate(ifHCInOctets[6m])
  • groups' interval and [target_labels, target_labels_re] are mutually exclusive.
  • when a target is scraped, rules are scanned for rule groups that matches target_labels and target_labels_re, and rule of these groups are executed, sequentially within each group
  • (target_labels or target_labels) must have at least 1 non empty label
  • rules are not evaluated upon unsuccessful scrapes. In that case, stale metrics are inserted (note: what about alerts??).
  • promql need to contain at least one metric. (vector(0) would not be valid). The metrics in promql see all the labels of the target attached ( foo becomes foo{job='xx',instance='dd',extral='x'})
  • it is guarantee that the group is running once at the same time for each target. If previous is still running for the target it will not be executed. there are counters to counted the missing scrapes.
@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 15, 2018

target_labels and target_labels_re,

This is inventing a new way to do label matching, which is already a bit icky for the alertmanager which is similar. How about using selectors?

rules are not evaluated upon unsuccessful scrapes. In that case, stale metrics are inserted

Rules must be evaluated after every scrape, rules can produce results when a scrape fails. This is the same as recording rules.

promql need to contain at least one metric.

I don't see why this is required.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Aug 15, 2018

so ..

target_labels: '{foo="bar",job=~"node|snmp"}

The reasoning about having at least one metric is to be sure that the scope of the recording rule can be limited. But indeed you are right it should not be mandatory.

One other question: Do we want to add the target labels to the resulting metrics?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 15, 2018

Yes, otherwise we couldn't ensure that that the labels were present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.