Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summation received in with sum_over_time is different from actual values #1876

Closed
rahulghanate opened this Issue Aug 8, 2016 · 10 comments

Comments

Projects
None yet
2 participants
@rahulghanate
Copy link

rahulghanate commented Aug 8, 2016

What did you do?
When I try to query the DB for a metric without sum_over_time function, then it shows the value correctly, whereas when I do the same with sum_over_time it gives different result.
But if I try for a different period of time for the same metric it shows correct value even with sum_over_time.
I am pushing values to Prometheus with help of Pushgateway server. And all values are getting pushed correctly.

What did you expect to see?
The value returned by sum_over_time function should return as per the value stored at individual times.

What did you see instead? Under which circumstances?
The value returned by sum_over_time is wrong for a certain event(20 percent of all values in day).
Given query results below for a period, but its seen for many timevalues of the day.

Environment
Test environment

  • System information:
    Linux 3.2.0-80-generic x86_64
  • Prometheus version:
# /var/lib/Prometheus/server/prometheus -version
prometheus, version 0.20.0 (branch: , revision: )
  build user:       root@e2b6843bdcf6
  build date:       20160628-14:23:13
  go version:       go1.5.3
  • Alertmanager version:
# /var/lib/Prometheus/alertManager/alertmanager -version
alertmanager, version 0.2.1 (branch: master, revision: 3f1c996)
  build user:       root@e2b6843bdcf6
  build date:       20160623-09:19:12
  go version:       go1.6.2

  • Prometheus configuration file:
global:
  scrape_interval: "5m"
  scrape_timeout: "20s"
  evaluation_interval: "5m"
rule_files:
  - './prometheus.rules'

scrape_configs:
  - job_name: "PerfData"
    scrape_interval: "30s"
    scrape_timeout:  "20s"
    target_groups:
    - targets: ['localhost:46004']
{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "10"
        },
        "values": [
          [
            1470564000.781,
            "9"
          ]
        ]
      },
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "11"
        },
        "values": [
          [
            1470567600.781,
            "8"
          ]
        ]
      }
    ]
  }
}

Query with sum_over_time

http://prometheus.mytest.com/api/v1/query_range?query=sum by (label_1, label_2,label_3) (sum_over_time(metric1{label_1 ="2344461436",label_2 = "288364153", job="PerfData"}[1h]))&start=2016-08-07T10:00:00.781Z&end=2016-08-07T11:05:00.781Z&step=1h

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "10"
        },
        "values": [
          [
            1470564000.781,
            "9"
          ]
        ]
      },
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "11"
        },
        "values": [
          [
            1470567600.781,
            "17"
          ]
        ]
      }
    ]
  }
}

Check that the above value for second entry comes as 17(should be 8), but first value(9) in vector is same in both query results.

Another strange behaviour is seen when I minimize the step size to min level.

As I have multiple values spanned over 5 minutes at the start of hour, but still sum_over_time results in single value for 10th hour, whereas gives some different value for 11th hour.

Query without sum_over_time by smaller steps

http://prometheus.mytest.com/api/v1/query_range?query=sum by (label_1, label_2,label_3) (metric1{label_1 ="2344461436",label_2 = "288364153", job="PerfData"})&start=2016-08-07T10:00:00.781Z&end=2016-08-07T11:05:00.781Z&step=1m

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "10"
        },
        "values": [
          [
            1470564000.781,
            "9"
          ],
          [
            1470564060.781,
            "9"
          ],
          [
            1470564120.781,
            "9"
          ],
          [
            1470564180.781,
            "9"
          ],
          [
            1470564240.781,
            "9"
          ]
        ]
      },
      {
        "metric": {
          "label_1": "2344461436",
          "label_2": "288364153",
          "label_3": "11"
        },
        "values": [
          [
            1470567600.781,
            "8"
          ],
          [
            1470567660.781,
            "8"
          ],
          [
            1470567720.781,
            "8"
          ],
          [
            1470567780.781,
            "8"
          ],
          [
            1470567840.781,
            "8"
          ]
        ]
      }
    ]
  }
}
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 8, 2016

Can you share the raw data that you're working off? I see nothing implausible here based on what you've provided so far.

@rahulghanate

This comment has been minimized.

Copy link
Author

rahulghanate commented Aug 9, 2016

Thanks for reply Brian.
Could you please give me details of what raw data is needed for further debugging and how do I retrieve it.
I retrieved above data with rest API. Do you want me to have those queries run without filters and aggregation functions?
Some sample queries or links will really help to provide more debugging data.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 9, 2016

I need basically query?query=metric1[1h] for the time series and time range in question.

@rahulghanate

This comment has been minimized.

Copy link
Author

rahulghanate commented Aug 9, 2016

Your questions made me curious to check values without aggregation function.
I found out that, the query I am calling,
sum_over_time(metric1{....}[1h], results in values of timerange 10:00:00 to 11:00:00
and the next hour iteration for the same metric is considering 1st second of that time range
11:00:00 to 12:00:00

So values are getting added at every intersection of time.
The _over_time function with range vector [1h/1m/1d], should not consider the last second from next hour/minute/day.
This resulting in different value whenever the range is switching.

Below are the raw values as per the query requested for two different values(10:00 and 11:00)

http://prometheus.mytest.com/api/v1/query?time=2016-08-07T10:10:00Z&query=(metric1{label_1%20=%222344461436%22,label_2%20=%20%22288364153%22,job=%22PerfData%22}[1h])

{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {
                    "__name__": "metric1",
                    "label_1": "2344461436",
                    "label_2": "288364153",
                    "exported_job": "PerfData",
                    "label_3": "10",
                    "instance": "localhost:46004",
                    "job": "PerfData",
                },
                "values": [
                    [
                        1470564000,
                        "9"
                    ]
                ]
            }
        ]
    }
}

http://prometheus.mytest.com/api/v1/query?time=2016-08-07T11:10:00Z&query=(metric1{label_1%20=%222344461436%22,label_2%20=%20%22288364153%22,job=%22PerfData%22}[1h])

{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {
                    "__name__": "metric1",
                    "label_1": "2344461436",
                    "label_2": "288364153",
                    "exported_job": "PerfData",
                    "label_3": "11",
                    "instance": "localhost:46004",
                    "job": "PerfData",
                },
                "values": [
                    [
                        1470567600,
                        "8"
                    ]
                ]
            }
        ]
    }
}

Below is the image of graph panel with max_over_time function to show spike at intersection,
screenshot from 2016-08-09 13-23-34

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 9, 2016

You're not allowing slack in your ranges for oddness, so this sort of thing will happen.

What exactly are you monitoring? That exact an timestamp is unnatural.

@rahulghanate

This comment has been minimized.

Copy link
Author

rahulghanate commented Aug 9, 2016

I am making use of Prometheus query language and alerting for data values in DB.
So I am pushing records from DB to prometheus through Pushgateway.
As records are populated in DB hourly, which are later populated in Pushgateway for querying in Prometheus.
I am assigning timestamp in pushgateway as ##:00:00. So all values will have timestamp as XX:00:00.000

As there are some filters associated with it I am using sum_over_time function to get summation over a hour, which somehow is also considering next range second.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 9, 2016

These are the semantics of the language, this is not the way you're meant to ingest this sort of data.

It sounds like you should be using direct instrumentation and a counter instead.

@rahulghanate

This comment has been minimized.

Copy link
Author

rahulghanate commented Aug 9, 2016

Thanks for the suggestion.
I will figure out how I can convert that to an incrementing counter.
But I still feel that the aggregation period should not consider the last second of the range.

You can close the issue if you feel that's the not the idea behind the range vectors.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 9, 2016

We've discussed it before, and inclusive ranges are the right behaviour. In any case we'd not be able to consider changing this until 2.0.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.