Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPromtool randomly failing when record not in same file as rule #5241
Comments
This comment has been minimized.
This comment has been minimized.
|
With dependent rule groups, you want to use group_eval_order |
This comment has been minimized.
This comment has been minimized.
|
As @brian-brazil mentioned, can you try fixing the order of the group evaluation and see if it is still failing randomly? You can find about it in the docs here https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ |
This comment has been minimized.
This comment has been minimized.
zonArt
commented
Feb 20, 2019
•
|
@brian-brazil Much better, thanks for the tip |
This comment has been minimized.
This comment has been minimized.
|
Indeed adding this fixed the bug. Thanks for pointing me to the right doc page : rule_files:
# These files are searched relatively from where you execute the test (not relative to the current file).
- alerts.yml
- records.yml
group_eval_order:
- records.yml
- alerts.ymlworks deterministic. |
This comment has been minimized.
This comment has been minimized.
|
Thanks a lot for the fast response @brian-brazil @codesome |
waberc
closed this
Feb 20, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
waberc commentedFeb 20, 2019
Bug Report
What did you do?
I am working with a file to write the record rules, another file for the alert rules and a third one for the test.
Records
Alerts
Test file
What did you expect to see?
The test should either fail always (because the calculated value is not right and I could adapt my test) or succeed always.
What did you see instead? Under which circumstances?
If I test this several times, I don't get always the same result (it looks like there are two different possible results/ways to calculate the records):
To investigate a bit further, as the alert was not triggering here, I lowered the alert threshold condition (<2 instead of <20) to have the alert always trigging and I saw that in some cases, the value is not 26.67 but 25.68. So it looks like the
promtoolhas two different ways to calculate things. Per my understanding, one way triggered the alert because it was >5m >20% and the other way to compute the metrics lead to >20% but not yet firing because <5m.Test with lowered threshold in alert rule "<2"
Finally we observed that changing the order the files (records, alerts) are mentioned in the test file just invert the success / failed rate as it will still have those two ways to calculate things but the probability is just the opposite. So it seams the order the files are listed is important but not always done in the mentioned order... (I personally don't get how it could evaluate the alerts before knowing the records...)
Test with reversed file import
rule_files:in test fileFinally this random behaviour disappears when we put all the rules in the same file : records and alerts.
If we put the records first, at the top of the file, then the alerts we get always SUCCESS.
If we put the alerts at beginning of the file, then the records, we get always FAILED, with this exact same value 25.68 instead of 26.67 or not firing because still pending.
So we reproduce the same two ways of calculate things, still strange; but deterministically.
Environment
Running on Mac (but same thing is reproduced on CentOS) :
Darwin 17.7.0 x86_64