Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtool randomly failing when record not in same file as rule #5241

Closed
waberc opened this Issue Feb 20, 2019 · 5 comments

Comments

Projects
None yet
4 participants
@waberc
Copy link

waberc commented Feb 20, 2019

Bug Report

What did you do?
I am working with a file to write the record rules, another file for the alert rules and a third one for the test.

Records

groups:
  - name: records.yml
    rules:
      - expr: |
          sum by (kubernetes_pod_name, app_name)
          (increase(http_server_requests_seconds_count{job="kubernetes-service-endpoints", status=~"4.."}[10m]))
        record: http_server_requests_seconds_count:http4xx:increase10m

      - expr: |
          sum by (kubernetes_pod_name, app_name)
          (increase(http_server_requests_seconds_count{job="kubernetes-service-endpoints"}[10m]))
        record: http_server_requests_seconds_count:increase10m

      - expr: 1 - ((vector(1) and hour() == 5 and minute() <= 30 and day_of_week() == 1) or vector(0))
        record: not_in_silence_schedule
        # use this to filter alerts during schedule : "every Monday between 5h and 5h30", multiply by `* scalar(not_in_silence_schedule)`

Alerts

groups:
  - name: alerts.yml
    rules:
      - alert: httpErrorCode4xxOnServiceOver20percent
        expr: |
          100 * sum without (kubernetes_pod_name) (http_server_requests_seconds_count:http4xx:increase10m)
            / sum without (kubernetes_pod_name) (http_server_requests_seconds_count:increase10m)
            * scalar(not_in_silence_schedule)
            > 20
        for: 5m
        labels:
          priority: P3
          team: solutions
        annotations:
          summary: '{{ $labels.app_name }}: more than 20% HTTP 4xx codes: {{ printf "%.2f" $value }}%'
          description: 'Application  {{ $labels.app_name }} has too many HTTP errors in 4xx: ({{ printf "%.2f" $value }}%).'

Test file

rule_files:
  # These files are searched relatively from where you execute the test (not relative to the current file).
  - records.yml
  - alerts.yml

evaluation_interval: 1m

tests:
#Test for HTTP status codes
  - interval: 1m
    input_series:
      - series: 'http_server_requests_seconds_count{status="200", uri="/something/{}/", app_name="query-api", job="kubernetes-service-endpoints", kubernetes_pod_name="query-api-59858b6549-cq2s5"}'
        values: '0+10x23'
      - series: 'http_server_requests_seconds_count{status="200", uri="/other/{}/", app_name="query-api", job="kubernetes-service-endpoints", kubernetes_pod_name="query-api-59858b6549-cq2s5"}'
        values: '0+1x23'
      - series: 'http_server_requests_seconds_count{status="403", uri="/else/does/not/work/anymore", app_name="query-api", job="kubernetes-service-endpoints", kubernetes_pod_name="query-api-59858b6549-cq2s5"}'
        values: '0+0x2 1+1x6 10+3x5 25+0x9'
      - series: 'http_server_requests_seconds_count{status="404", uri="/else/weird", app_name="query-api", job="kubernetes-service-endpoints", kubernetes_pod_name="query-api-59858b6549-cq2s5"}'
        values: '0+2x23'
      - series: 'http_server_requests_seconds_count{status="500", uri="/some/exception", app_name="query-api", job="kubernetes-service-endpoints", kubernetes_pod_name="query-api-59858b6549-cq2s5"}'
        values: '0+0x15 20+20x8'

        #t 0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
        # HTTP 200
        #  0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210
        #  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
        # HTTP 4xx
        #  0   0   0   1   2   3   4   5   6   7  10  13  16  17  20  23  25  25  25  25  25  25
        #  0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42
        # HTTP 5xx
        #  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  20  40  60  80 100 120

    alert_rule_test:
      - eval_time: 14m
        alertname: httpErrorCode4xxOnServiceOver20percent
        exp_alerts:
          - exp_labels:
              app_name: query-api
              priority: P3
              team: solutions
            exp_annotations:
              description: "Application  query-api has too many HTTP errors in 4xx: (26.67%)."
              summary: "query-api: more than 20% HTTP 4xx codes: 26.67%"

What did you expect to see?
The test should either fail always (because the calculated value is not right and I could adapt my test) or succeed always.

What did you see instead? Under which circumstances?
If I test this several times, I don't get always the same result (it looks like there are two different possible results/ways to calculate the records):

 $  for i in {1..10}; do echo "Test $i $(python -c 'import datetime; print(datetime.datetime.now())') : $(./promtool test rules test-rules.yml | egrep 'FAILED|SUCCESS')"; done
Test 1 2019-02-20 16:44:53.821979 :   SUCCESS
Test 2 2019-02-20 16:44:53.942164 :   SUCCESS
Test 3 2019-02-20 16:44:54.036496 :   SUCCESS
Test 4 2019-02-20 16:44:54.126070 :   SUCCESS
Test 5 2019-02-20 16:44:54.216438 :   SUCCESS
Test 6 2019-02-20 16:44:54.308160 :   SUCCESS
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[]"
Test 7 2019-02-20 16:44:54.395952 : 
Test 8 2019-02-20 16:44:54.481421 :   SUCCESS
Test 9 2019-02-20 16:44:54.571751 :   SUCCESS
Test 10 2019-02-20 16:44:54.661079 :   SUCCESS

To investigate a bit further, as the alert was not triggering here, I lowered the alert threshold condition (<2 instead of <20) to have the alert always trigging and I saw that in some cases, the value is not 26.67 but 25.68. So it looks like the promtool has two different ways to calculate things. Per my understanding, one way triggered the alert because it was >5m >20% and the other way to compute the metrics lead to >20% but not yet firing because <5m.

Test with lowered threshold in alert rule "<2"

$  for i in {1..10}; do echo "Test $i $(python -c 'import datetime; print(datetime.datetime.now())') : $(./promtool test rules test-rules.yml | egrep 'FAILED|SUCCESS')"; done
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 1 2019-02-20 16:45:12.691844 : 
Test 2 2019-02-20 16:45:12.787618 :   SUCCESS
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 3 2019-02-20 16:45:12.877354 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 4 2019-02-20 16:45:12.966859 : 
Test 5 2019-02-20 16:45:13.061536 :   SUCCESS
Test 6 2019-02-20 16:45:13.151596 :   SUCCESS
Test 7 2019-02-20 16:45:13.242083 :   SUCCESS
Test 8 2019-02-20 16:45:13.328800 :   SUCCESS
Test 9 2019-02-20 16:45:13.417598 :   SUCCESS
Test 10 2019-02-20 16:45:13.520905 :   SUCCESS

Finally we observed that changing the order the files (records, alerts) are mentioned in the test file just invert the success / failed rate as it will still have those two ways to calculate things but the probability is just the opposite. So it seams the order the files are listed is important but not always done in the mentioned order... (I personally don't get how it could evaluate the alerts before knowing the records...)

Test with reversed file import rule_files: in test file

 $  for i in {1..10}; do echo "Test $i $(python -c 'import datetime; print(datetime.datetime.now())') : $(./promtool test rules test-rules.yml | egrep 'FAILED|SUCCESS')"; done
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 1 2019-02-20 16:59:08.112301 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 2 2019-02-20 16:59:08.367946 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 3 2019-02-20 16:59:08.458847 : 
Test 4 2019-02-20 16:59:08.549136 :   SUCCESS
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 5 2019-02-20 16:59:08.638319 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 6 2019-02-20 16:59:08.728282 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 7 2019-02-20 16:59:08.818668 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 8 2019-02-20 16:59:08.905431 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 9 2019-02-20 16:59:08.997226 : 
  FAILED:
    alertname:httpErrorCode4xxOnServiceOver20percent, time:14m0s, 
        exp:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (26.67%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 26.67%\"}]", 
        got:"[Labels:{alertname=\"httpErrorCode4xxOnServiceOver20percent\", app_name=\"query-api\", priority=\"P3\", team=\"solutions\"} Annotations:{description=\"Application  query-api has too many HTTP errors in 4xx: (25.68%).\", summary=\"query-api: more than 20% HTTP 4xx codes: 25.68%\"}]"
Test 10 2019-02-20 16:59:09.115114 : 

Finally this random behaviour disappears when we put all the rules in the same file : records and alerts.

If we put the records first, at the top of the file, then the alerts we get always SUCCESS.
If we put the alerts at beginning of the file, then the records, we get always FAILED, with this exact same value 25.68 instead of 26.67 or not firing because still pending.

So we reproduce the same two ways of calculate things, still strange; but deterministically.

Environment

  • System information:

Running on Mac (but same thing is reproduced on CentOS) : Darwin 17.7.0 x86_64

  • Prometheus version:
    ./promtool --version
promtool, version 2.6.0 (branch: master, revision: 43c9d9e91fb7eb875634356661e4ea10ae39facb)
  build date:       20190214-06:25:51
  go version:       go1.11
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 20, 2019

With dependent rule groups, you want to use group_eval_order

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Feb 20, 2019

As @brian-brazil mentioned, can you try fixing the order of the group evaluation and see if it is still failing randomly? You can find about it in the docs here https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

@zonArt

This comment has been minimized.

Copy link

zonArt commented Feb 20, 2019

@brian-brazil Much better, thanks for the tip

@waberc

This comment has been minimized.

Copy link
Author

waberc commented Feb 20, 2019

Indeed adding this fixed the bug. Thanks for pointing me to the right doc page :

rule_files:
  # These files are searched relatively from where you execute the test (not relative to the current file).
  - alerts.yml
  - records.yml

group_eval_order:
  - records.yml
  - alerts.yml

works deterministic.

@waberc

This comment has been minimized.

Copy link
Author

waberc commented Feb 20, 2019

Thanks a lot for the fast response @brian-brazil @codesome

@waberc waberc closed this Feb 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.