Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase(...) appears to treat discontinuity as 0-value (leading to larger than expected increase) #3852

Closed
jamesblackburn opened this Issue Feb 16, 2018 · 11 comments

Comments

Projects
None yet
2 participants
@jamesblackburn
Copy link

jamesblackburn commented Feb 16, 2018

What did you do?

Attempted to use increase to see change in a value over a 24h period.

What did you expect to see?

In the data we see a large jump in increase which seems to correlate with a missed scrape for the service endpoint.

Observed data:
image

image

What did you see instead? Under which circumstances?

Large increase observed around the discontinuity.

image

image

Environment

  • System information:

    insert output of uname -srm here

jblackburn@dlonapahls91:~$ uname -srm
Linux 3.10.0-693.2.2.el7.x86_64 x86_64
  • Prometheus version:
[root@a52004ff3fe8 /]# /usr/bin/prometheus --version
prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 16, 2018

That looks like the effect of a counter going backwards. Can you share the raw data of one of those time series around that time?

@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

Hi -

Taking one of the metrics:
Raw Data

message_count{instance=~"1.*", env="prod", service="RSF", partition="14"}

image

http://prometheus.dev.ahl:9090/api/v1/query_range?query=message_count%7Binstance%3D~%221.*%22%2C%20env%3D%22prod%22%2C%20service%3D%22RSF%22%2C%20partition%3D%2214%22%7D&start=1518020340&end=1518034740&step=57&_=1518780931424
query1data.txt

Increase

increase(message_count{instance=~"1.*", env="prod", service="RSF", partition="14"}[24h])

image
http://prometheus.dev.ahl:9090/api/v1/query_range?query=increase(message_count%7Binstance%3D~%221.*%22%2C%20env%3D%22prod%22%2C%20service%3D%22RSF%22%2C%20partition%3D%2214%22%7D%5B24h%5D)&start=1518020340&end=1518034740&step=57&_=1518780931425
increasedata.txt

There's a jump between these two timesteps in increase:

[1518034077,"32034316.86392156"],
[1518034134,"482297848.49101365"]

But there is no similar jump in the data.

Even if I query the data at 1s precision, I don't see anything untoward in the data:
image
query1s.txt

other than a discontinuity -

[1518026146,"445224004"],
[1518026147,"445224004"],
[1518026793,"445957884"],
[1518026794,"445957884"],
@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

There are two other things that may be interesting / relevant:

  • there is a counter reset ~27 hours later, which appears to be correctly handled by increase
  • there are other scrape gaps, which appear to be correctly handled

Screenshot attached showing both:

image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 16, 2018

There's a jump between these two timesteps in increase:

Can you give me the raw data around those times?

@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

The raw data was in the increasedata.txt
https://github.com/prometheus/prometheus/files/1731161/increasedata.txt

and at 1s resolution query1s.txt:
https://github.com/prometheus/prometheus/files/1731173/query1s.txt

Let me know if there's any other data you would like (or some other way of getting the data).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 16, 2018

The data in query1s.txt does not cover the time period of the issue in increasedata.txt.

@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

Ah, sorry about this - it does look like something strange has happened - the counter has briefly gone backwards by '163'...

query1s up to 1518034134:
query1s-1518034134.txt

[1518034098,"450214079"],
[1518034099,"450216699"],
[1518034100,"450216536"],
...
[1518034112,"450216536"],
[1518034113,"450220374"]

increase1s up to 1518034134:
increase1s-1518034134.txt

[1518034099,"32024457.718303636"],
[1518034100,"482298795.2844846"]
@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

We're digging deeper into this - we've recently migrated from Prometheus 1.x to 2.x and it may be that there's some funny interaction between the old and the new with remote read.

@jamesblackburn

This comment has been minimized.

Copy link
Author

jamesblackburn commented Feb 16, 2018

So we think this is caused by a discontinuity caused by the migration from Prometheus 1.x to 2.x and remote read.

Sorry for the noise!

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 17, 2018

This is the first report of a discontinuity from such a setup. This implies that users should be careful not to have scraping working for at least a scrape interval to mitigate this.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.