Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General purpose rule/alert testing tool #1695

Closed
tamsky opened this Issue May 31, 2016 · 28 comments

Comments

Projects
None yet
10 participants
@tamsky
Copy link

tamsky commented May 31, 2016

https://groups.google.com/d/msg/prometheus-developers/h6Vp2pf619A/AJsKJsJTHQAJ
DL;DR (very edited) summary:

Q on (8/7/15) by @swsnider : https://github.com/prometheus/prometheus/promql/test.go -- Is it appropriate to write some wrapper code to turn this into a general-purpose rule/alert testing tool?

A by @fabxc :
It is absolutely on our roadmap (should probably be added to the actual one on the website).
What's blocking it is not so much implementation but specification.

I would like to help get the specification ball rolling here.

@fabxc Has there been any work done to spec this out since that post?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 2, 2016

Yes, getting the discussion started would help a lot and make it easier for people willing to contribute to start working on something. So far this area has not progressed.

For these kinds of things an RFC in a shared Google doc generally works best.

@tamsky

This comment has been minimized.

Copy link
Author

tamsky commented Jun 2, 2016

Let me get the ball rolling with a few ideas here, a google doc can follow.

(in the same groups thread) @fabxc wrote:
the syntax for evaluation is something that strongly bothers me. Specifying time (ranges) is fine for the current testing but not suitable for a more general purpose language.

Perhaps you can put into a few more words what bothers you regarding the current time (range) stuff, and alternatives that might help?

The current test syntax (to me) appears to be very simple, which is good:
load <step:duration>)
eval[_fail|_ordered] instant [at <offset:duration>] <query>

Concepts I think are missing:

  • ability to include ../*.rules
  • NOW <date:dateTime>|<[-+]offset:duration>
  • evaluation_interval (allows rule testing tool to interact with the global config param)
  • eval_vector (accepts an instant-vector, and drops all labels)
  • eval range
  • a way to express missed scrapes in load which should affect metric staleness

NOW would provide a way to set the date+time for all actions that follow, simplifying eval commands. Example ::

clear
evaluation_interval 5m
include "../rules/*.rules"

NOW 2016-06-02T00:00:00
load 5m
    http_errors{result="404"} 0+10x12 120x12   # steady error rate for 1h, zero additional for 1h

    # eval without an 'at' clause implies clock == NOW command above
    eval instant ALERT{alertname="http_errors", alertstate!~"pending", alertstate!~"firing"}
         # no data here == alert is not active

NOW 2016-06-02T00:30:00
    eval_vector range ALERT{alertname="http_errors", alertstate="pending"}[30m]   
    {}  1 1 1 1 0        # pending -> firing transition

NOW +30m
    eval_vector range ALERT{alertname="http_errors", alertstate="firing"}[10m]
    {}  1 0        # firing -> cleared

Especially as promtool should eventually replace prometheus_cli and the should use the same language for that.

Are there any query syntax changes proposed yet (in #1605 or elsewhere)?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 12, 2017

Unfortunately we didn't follow up on the topic for a while even though we strongly want it.

One main concern I have in hindsight is the DSL we currently use for testing PromQL. While it's brief and precise, it's yet another language we have to parse, and more importantly, specify ourselves.
Especially as we want to extend it as discussed here, this seems to be asking for trouble. We should discuss switching to a bit more noisy but easier to handle YAML.

Especially as we want to have the following, partially overlapping, features:

  • easy to define PromQL tests
  • easy to define PromQL benchmarks
  • alerting rule unit testing for users (which can probably be reduced to test data + PromQL expression)

I also don't see the DSL incorporate loading of arbitrary test/benchmarking sample data well, which won't be specifiable in 20 characters.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 12, 2017

alerting rule unit testing for users (which can probably be reduced to test data + PromQL expression)

We need rule file parsing and evaluation for that too, as there can be several layers of rules between data and alerts.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Apr 2, 2017

For alerts, this should also have the ability to test that the label and annotation templates expand to the expected values.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 2, 2017

@gouthamve That part is already mentioned/covered in #1220, although it could be additionally relevant here too.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 2, 2017

And also #1219.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 16, 2017

If we want to actually test not just whether the expression triggers but also whether it generates the right annotations and labels, we need to consider the full alert.
Copying an alert into a unit test file works but will with high certainty drift over time. So practically, it would be good to just reference it by <rulfile>/<alertname>. This won't work because alert names don't have to be unique, even on a per-file basis. The only other good option then seems to be inlining of unit tests directly into the rule files. That might be a bit noisy though.

Would be open to enforcing alertname uniqueness (I don't think many people use duplicates anyway) or does anyone have other ideas to address this?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 16, 2017

I think using the same alertname with different severity labels (expression with different thresholds) would be a common use case.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 16, 2017

Yes, duplicate alertnames with different labels is common. There could also be duplicates across rules from different teams.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 16, 2017

Forgot about warn/page level alerts with the same name – haven't written alerts in a while and it shows :)

Anyway, so how would we tackle this then? Inline unit tests directly in rules files or rely on users to keep their alerts and their copies in unit tests in sync?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 16, 2017

I would have it that unit tests can pull in the rule files/groups that are relevant.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 16, 2017

Yea sure, but how is this supposed to happen if there's no unique identifier to reference a rule by?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 16, 2017

This is why I've been saying that rule groups should usually be named.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 16, 2017

Or just specify a (list of) rule files to load in whole, a list of series and their samples as input (similarly to how we load in data for testing expressions), and the entirety of expected received alerts / notifications as output. That still allows you to test only one alert per test if you want (simply reference the same rule files from multiple test cases, but with different expected inputs / outputs). It also handles the case nicely where an alerting rule depends on other recording rules, etc.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 16, 2017

Mh, that might actually work – makes it easy to define cases triggering warn and severe at once too. Would still constrain that to <file>:<group_name> as references though.

@gouthamve

@aweiteka

This comment has been minimized.

Copy link

aweiteka commented Dec 5, 2017

I have a high interest in rules testing. I've tried m-lab/prometheus-support#30 and like the general pattern.

I'll be at KubeCon 2017 in Austin this week if folks want to discuss.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Dec 5, 2017

@serathius

This comment has been minimized.

Copy link
Contributor

serathius commented Mar 16, 2018

Is anyone working on this issue? I would be interested in pushing it further with some help.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 16, 2018

Noone is working on this that I'm aware of.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 16, 2018

/cc @codesome who wants to put this in his GSoC proposal

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Mar 16, 2018

Yes, I plan to add this in my GSoC proposal

@codesome

This comment has been minimized.

@kevinjqiu

This comment has been minimized.

Copy link

kevinjqiu commented Jun 18, 2018

Ahh didn't know it's a feature being actively worked on. At my work, we came up with this https://github.com/kevinjqiu/pat and hopefully this can be of some use to someone.

@DeathBorn

This comment has been minimized.

Copy link

DeathBorn commented Jul 4, 2018

for some reson this pat tool does not work with record rules included in alerts.

@kevinjqiu

This comment has been minimized.

Copy link

kevinjqiu commented Jul 4, 2018

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Sep 25, 2018

Got added in #4350.
@brian-brazil I think we can close this.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.