Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All data before the first scrape is lost when using rate / irate / increase #3886

Closed
veqryn opened this Issue Feb 25, 2018 · 19 comments

Comments

Projects
None yet
3 participants
@veqryn
Copy link

veqryn commented Feb 25, 2018

What did you do?
Used rate / irate / increase to see how much things have increased. In this case, we had an overall increase in our count by 15.

What did you expect to see?
I expected to see spikes adding up to 15.

What did you see instead?
Spikes adding up to 11.

Environment
Server: Prometheus 2.0.0, Revision 0a74f98
Client: Prometheus Golang Client library (official)
(all our metrics are initialized and published at their zero-state, as per how the golang client library works)

Summary
I built a bunch of graphs and dashboards, but they just didn't seem to being showing accurate data. After investigating and comparing with our logs, I confirmed that all of my graphs and dashboards show data that doesn't match what is actually happening in my system.

The problem appears to be that when using rate / irate / increase to see the rate that metrics increase, all data/metrics that happen before the 1st scrape are completely ignored/dropped by prometheus.

We have a Kubernetes cluster with an auto-scaling cluster. New pods are constantly being created, and old ones are constantly being killed off, or sometimes getting restarted.
Our scrape interval for everything in our cluster is 60 seconds.

Because we auto-scale up and down quite quickly, and because we scale up when we are the busiest, I estimate we lose about 20-40% of our data (not data points, but the actual values; so for example I would be expecting spikes adding up to 5000 over a 12 hour period, yet prometheus shows spikes adding up to 3000, because the largest spikes are all missing).

For example, here we show a bunch of 4 new pods starting up, and processing 15 emails:
image

But the spikes add up to only 11, because we are missing the first 4 data points:
image

When you don't stack the first graph, you can the problem is because the 4 pods all managed to process 1 email before the first scrape, so the metric came in at 1 for each of them, instead of starting at zero:
image

I don't think it is reasonable to ask our processes/pods to startup and then sleep for 60 seconds to ensure that a scrape takes place before any work gets done.

I do not know what the best solution is, but I hope this issue will get addressed, since it makes the rate/irate/increase functions misleading at best, and outright wrong at worst.

One potential idea would be to have the client libraries publish a timestamp of when the prometheus library was initialized or when each metric got registered. Those timestamps would never change for the life of the process/container, so they would only need to be read in and parsed by the prometheus server during the first successful scrape. The server could then see that the metric actually started at zero, at that timestamp.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 25, 2018

rate() already accounts for this with its extrapolation at the start and end of a series, it will produce the correct answer on average. The exact same happens when an instance dies, and there's no magic solution there.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Feb 25, 2018

There are two things here:

  1. Prometheus can not assume that counters start at 0. A prometheus server could start scraping a target that has been running for a while. So the functions will always miss the first point.

  2. Best practices say that you should if possible initiate counters with "0". In that case you have more chances to avoid the problem.

  3. You could also do a increase(foo) or foo as a query but you might end up with strange results.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 25, 2018

Fundamentally if you want exact results metrics are not the approach to take, you should use logs instead. We could make counter handling ever more intricate, but it doesn't change the fact that rate() will always be an estimate. For every issue at the start of a counters life, there is an equivalent issue at the end of its life.

I'd suggest looking at https://www.robustperception.io/existential-issues-with-metrics/

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

@roidelapluie

What is the use case for counter metrics not starting at zero?
I thought this was a very safe assumption, and it is what every single client library does. All metrics start at zero when they are first created/registered. Prometheus may not scrape them until after the target has been running for a bit, but that doesn't invalidate that they started at zero. If there are no good use cases, I think it would be very safe and valid to acknowledge that all counters start at zero (on the client side at least) and make a rule about it.

@brian-brazil

rate shouldn't need to extrapolate for the start, because we know that all counters start at zero. (There may be questions around when the counter started, which is why my idea suggests a timestamp, but it wouldn't be the only way to solve this problem.)

I don't think anyone is going to argue about the end of a metric. If a process dies, there is no way to know what it did between the last scrape and the time it died, and I fully acknowledge that.

But there are things that can be done about the start of counter metrics, so it would be great to handle that better. It is doable, and it doesn't need to make extrapolations because the data (that it started at zero) does exist.

Here are some further examples from production:

The upper graph shows a process that is retrieving messages. It is being autoscaled (kube pods), with lots of new pods being created during the time span below. For each message retrieved, it puts it on a queue.
The lower graph shows a process that is pulling messages off that queue a second or two after they are put on the queue. It is NOT being autoscaled (yet), and had a stable number of pods during the time span.
When deep diving into the logs, we show that the two numbers should be equal for this time range.
Yet, because of the issue described above, the rate function is returning values that add up to about 30% less than they should on average. It is also missing a lot of spikes that should exist.
image

Here are the same processes described above, but overlayed on top of each other so that the difference is more apparent.
The pods that are being autoscaled end up reporting numbers that are 40% off from reality, with the graph missing quite a few spikes at various points.
image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

But there are things that can be done about the start of counter metrics, so it would be great to handle that better.

We could, but then we'd need to subtract the same adjustment from the end of lifecycle logic so you'd not gain anything. The current rate has been experimentally verified to produce the correct result given the whole lifecycle of a target.

The pods that are being autoscaled end up reporting numbers that are 40% off from reality, with the graph missing quite a few spikes at various points.

How fast are those churning? If targets only exist for a few scrape intervals, you are going to have a bad time.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

rate shouldn't need to extrapolate for the start, because we know that all counters start at zero.

Rate's extrapolation already takes this into account.

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

I do not see why any adjustment would need to be subtracted from the end of a lifecycle.
We know exactly that counters start at zero, regardless of their value at the first scrape. We do not know exactly when a process dies. Depending on what the rate of change was close to the last scrape, you can extrapolate what the likely final value would have been on average (ie: recent rate of change per second * half the scrape interval, or w/e). This doesn't affect or change the fact that counters start at zero, and accounting for them starting at zero shouldn't cause prometheus' rate function to subtract anything from the end of a series.

We have 2-20 pods running, with a few living for a long time and others just 10 minutes. It is pretty dependent on the load, but on average, pods seem to live about 20-60 minutes.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

I do not see why any adjustment would need to be subtracted from the end of a lifecycle.

We already know that our algorithm is correct on average. So if we were to increase the result at the start of the lifecycle without decreasing it at the end we would overestimate.

Depending on what the rate of change was close to the last scrape, you can extrapolate what the likely final value would have been on average

That's what rate already does.

We have 2-20 pods running, with a few living for a long time and others just 10 minutes.

That should be okay, though if they're being created a lot it's possible that it's taking service discovery a while to return them.

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

So... the algorithm extrapolates to determine the rate of change before the counter was first scraped, and extrapolates to determine the rate of change after the counter was last scraped. And what it comes up with is correct on average.

To fix the above issue, we would be replacing the extrapolation for the start period, with the actual value that it increased by. The extrapolation at the end would remain unchanged.
If the algorithm was correct on average, then there should be no change on average by replacing the extrapolated value with the actual value at the start. (IF the algorithm is actually correct on average, then it wouldn't be an increase to use the actual values.)

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Feb 26, 2018

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

If a pod starts up, and prometheus fails to scrape for 30 minutes, and then it finally scrapes a value of 5000 for the counter, then we know that the counter went up by 5000 over that time period. No assumptions needed.
You may not know whether it went up evenly over that time period, or all in a single spike, but you do know that it went up by 5000. And since rate is already making lots of extrapolations and averages, I would certainly prefer it showed in some way that the counter went up by 5000, rather than just pretending that absolutely nothing happened.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Feb 26, 2018

I do not agree. Counter may have been reset x times during the period.

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

How or why would a counter reset? My understanding was that a counter resets because the process stopped/died and a new process started.
In any case, I don't see how a counter reset in this situation would be any different than having a pod that never got scraped at all, which then died.

Lets say you have a process start up, it counts up to 1000, then resets back to zero (or dies and restarts, or w/e), then it counts up to 9000, then resets back to zero again, and then counts up to 5000 and finally gets scraped for the first time.
Prometheus at this point does not know how many times it has been reset (or how many pods out there existed and died before ever getting scraped).
But it does know that we counted from at least 0 to 5000.
Given this situation, Prometheus should show that increase to 5000 somehow. The idea that we would rather show absolutely nothing instead of 5000 does not make any sense to me.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

Even simpler, you don't know when that increase of 5000 happened. It could have been in the last 30 minutes, it could have been last week.

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

This is why I suggested put a timestamp of when the counter started, in the metrics endpoint that clients expose.
There are other possible solutions though, as that was just one idea.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

Now we're going in circles.

This is all ground which has been trod before. Prometheus is not suitable for if you care about every single event, you want logging for that. Counters that are slowly and irregularly incremented are also more susceptible to artifacts, if you were to have more traffic you wouldn't see this issue. It is possible your autoscaling system is having oscillation issues which would exasperate this by increasing churn, so I'd look into dampening it.

@veqryn

This comment has been minimized.

Copy link
Author

veqryn commented Feb 26, 2018

Can you please say why my proposed solution would not work?

I believe you are closing this out-of-hand and way too early.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 26, 2018

I have already gone over this above.

While this may all be new to you, I have already spent over a day of my time this month explaining to various users why their suggested improvements to rate are not as easy fixes as they may seem. I have many things to work on, so I'm afraid I can't give everyone a detailed individualised answer every time the same questions come up again and again. https://www.youtube.com/watch?v=67Ulrq6DxwA explains the reasons behind what we currently do.

JensRantil added a commit to tink-ab/opsgenie-prometheus-exporter that referenced this issue Mar 30, 2018

fix: allow scraper to scrape counter zero values
This is a hack to allow `increase` to work. See [1].

[1] prometheus/prometheus#3886 (comment)
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.