Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtool test rules: off-by-one bug for functions applied to range vectors #4874

Closed
bititanb opened this Issue Nov 16, 2018 · 12 comments

Comments

Projects
None yet
3 participants
@bititanb
Copy link

bititanb commented Nov 16, 2018

Bug Report

What did you do?

promtool test rules test.yml

What did you expect to see?

    FAILED   # our alerts shouldn't trigger

What did you see instead?

    SUCCESS

Under which circumstances?

alerts.yml:

  - alert: SumOfInstanceStatusesOver1m
    expr: sum_over_time(up[1m]) == 2    # in prom expression browser this evaluates to 1

  - alert: SumOfInstanceStatusesOver2m
    expr: sum_over_time(up[2m]) == 3    # and this evaluates to 2 for '1 1 1' series
test.yml:

          - series: 'up{job="node_exporter", instance="localhost:9100"}'
            values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

          - eval_time: 5m
            alertname: SumOfInstanceStatusesOver1m # this shouldn't trigger normally
            exp_alerts:
                - exp_labels:
                      instance: localhost:9100
                      job: node_exporter
          - eval_time: 5m
            alertname: SumOfInstanceStatusesOver2m # this shouldn't too
            exp_alerts:
                - exp_labels:
                      instance: localhost:9100
                      job: node_exporter

Environment

  • System information:
Linux 4.15.0-36-generic x86_64
  • Prometheus version:
prometheus, version 2.5.0
(branch: promtool-issue-4838, revision: 30540d48301b9881c8ab66c3c6204f9aee49aa4b)
  build user:       root@prometheus
  build date:       20181116-12:12:49
  go version:       go1.10.4
  • Promtool version:
promtool, version 2.5.0
(branch: promtool-issue-4838, revision: 30540d48301b9881c8ab66c3c6204f9aee49aa4b)
  build user:       root@prometheus
  build date:       20181112-09:53:36
  go version:       go1.10.1
  • alerts and tests configuration files:

Modified example configs from docs:
https://gist.github.com/bititanb/1164058e966f1faf5b2aa4bdc3454249/revisions

More:

Same thing happens at least with deriv() function. deriv(up[1m]) works in tests but in prometheus expression browser it is kind of not a valid expression at all, as it always returns no datapoints found. Looks like deriv(up[1m]) just should be deriv(up[2m]) and sum_over_time(up[1m]) should be sum_over_time(up[2m]) and so forth in tests.

Thanks. /CC @codesome

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2018

I don't see a problem here. Why do you think this is wrong?

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Nov 16, 2018

@brian-brazil Same expression sum_over_time(up[2m]) returns different results in prometheus web expression browser (returns 2) and promtool test rules (returns 3) for same set of data, so alerts that passed tests do not work with actual prometheus instance and vice versa. Maybe I'm missing something?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2018

Are you running exact the same query with exact the same time offsets relative to the data?

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Nov 16, 2018

Yep, identical — same data, same query (and same range), no offsets or anything. I was investigating why alerts which passed tests don't work in production, and looks like they do work in production now after I tuned them according to this -1 strange behavior, but now my tests fail.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2018

Time ranges are inclusive in PromQL, so it sounds like your alerts aren't resilient to jitter and/or failed scrapes.

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Nov 16, 2018

Not sure I got the point.

For example, how can I test some really simple expression (just for the purpose of testing, it could be more complex and useful, but the problem will stay) like sum_over_time(metric_name[10m]) == 100?

It's true that it is not resilient to NaN in some timeseries in this range (though I did not find a way to filter NaN in range vectors) and it always returns NaN in this case, but not sure what do you mean by jitter. There wasn't any NaN though in the input for my expressions when I tested.

So is this sum_over_time(metric_name[10m]) == 100 should be expressed in some other way? Because atm it looks like [10m] becomes "get 10 series" in prometheus but "get 11 series" in promtool unittests.

Thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2018

Depending on races and what not, you might get anywhere between 0 and 11 points for the sum.

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Nov 16, 2018

Wow, OK. So my tests basically test nothing.

From UX perspective I don't see any value in such tests though, because I'm not sure I can write alerts that will work reliably in both testing and production environments for such inconsistent data ranges. It doesn't work even with simplest expressions and inputs and prometheus instance with almost zero load and 1m scrape/eval intervals.

You can close it I suppose?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2018

The tests are for PromQL in an idealised environment, if your expressions are fragile you may have unexpected results.

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Nov 16, 2018

Checked again on clean install of prometheus and it still seems weird. No exporters besides of prometheus itself on :9090, so very low load. Eval/scrape intervals 1min for both prometheus and promtool unittests.

From prometheus sum_over_time(up[1m]) always returns 1 result, very consistently, I can see it in Graph view, just a straight line for 30m already. promtool test rules always (1000 runs) returns 2 results for the same expression.

At the same time timestamps (and values) for up[10m] in prometheus look like that, very consistent too:

1 @1542400149.131
1 @1542400209.131
1 @1542400269.131
1 @1542400329.131 
1 @1542400389.131 
1 @1542400449.131 
1 @1542400509.131 
1 @1542400569.131
1 @1542400629.131
1 @1542400689.131

If these timestamps are real, if I understand inclusiveness correctly, up[1m] should actually always return 2 points instead of 1 (scrape interval is 60s).

Is this really by design? If these -1's are so consistent maybe it will make sense from UX perspective to make unittests match OOB (basically idealised, but still real world) experience of prometheus? Because atm I found no way to make prometheus match unittesting results 1:1 for even a single time when working with time ranges.

EDIT: I'm wrong about inclusiveness, it should actually return 1 result, as it does. Still inconsistency between prometheus and promtool persists.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Dec 3, 2018

Hi, for the values you posted above:

1 @1542400149.131
1 @1542400209.131
...

If sum_over_time(up[1m]) aligns with any of the timestamps, i.e, is issued at 1542400209.131, you will have the query return 2. Any other timestamp even on millisecond off would give 1. So you can see why Prometheus always returns 1 because you need to be incredibly lucky to get 2.

Having said that, Prometheus can return 2 and your query is inherently brittle. I don't think this is a bug in the tests and ideally tests should catch edge cases like the ones above.

@bititanb

This comment has been minimized.

Copy link
Author

bititanb commented Dec 3, 2018

Maybe you are right and nothing should be changed. In the end I went for at least [5m] ranges and it works fine so far. Still I didn't find any info in docs about ranges have been a little unreliable...

you might get anywhere between 0 and 11 points for the sum

so if it's not there would be nice to have a few words (in best practices maybe?) to not use things like deriv[2m] because it might not work at all and go for at least maybe 5-10m ranges.

Anyways, thanks for the explanations!

@bititanb bititanb closed this Dec 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.