Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAll data before the first scrape is lost when using rate / irate / increase #3886
Comments
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
There are two things here:
|
This comment has been minimized.
This comment has been minimized.
|
Fundamentally if you want exact results metrics are not the approach to take, you should use logs instead. We could make counter handling ever more intricate, but it doesn't change the fact that I'd suggest looking at https://www.robustperception.io/existential-issues-with-metrics/ |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
We could, but then we'd need to subtract the same adjustment from the end of lifecycle logic so you'd not gain anything. The current rate has been experimentally verified to produce the correct result given the whole lifecycle of a target.
How fast are those churning? If targets only exist for a few scrape intervals, you are going to have a bad time. |
This comment has been minimized.
This comment has been minimized.
Rate's extrapolation already takes this into account. |
This comment has been minimized.
This comment has been minimized.
|
I do not see why any adjustment would need to be subtracted from the end of a lifecycle. We have 2-20 pods running, with a few living for a long time and others just 10 minutes. It is pretty dependent on the load, but on average, pods seem to live about 20-60 minutes. |
This comment has been minimized.
This comment has been minimized.
We already know that our algorithm is correct on average. So if we were to increase the result at the start of the lifecycle without decreasing it at the end we would overestimate.
That's what rate already does.
That should be okay, though if they're being created a lot it's possible that it's taking service discovery a while to return them. |
This comment has been minimized.
This comment has been minimized.
|
So... the algorithm extrapolates to determine the rate of change before the counter was first scraped, and extrapolates to determine the rate of change after the counter was last scraped. And what it comes up with is correct on average. To fix the above issue, we would be replacing the extrapolation for the start period, with the actual value that it increased by. The extrapolation at the end would remain unchanged. |
This comment has been minimized.
This comment has been minimized.
|
On 25 Feb 22:42, Chris Duncan wrote:
@roidelapluie
What is the use case for counter metrics not starting at zero?
Easiest one: Prometheus scrape fails for 30 minutes then work again.
Then you do not know what happened during the 30 minutes and can not
make any assumptions.
|
This comment has been minimized.
This comment has been minimized.
|
If a pod starts up, and prometheus fails to scrape for 30 minutes, and then it finally scrapes a value of 5000 for the counter, then we know that the counter went up by 5000 over that time period. No assumptions needed. |
This comment has been minimized.
This comment has been minimized.
|
I do not agree. Counter may have been reset x times during the period. |
This comment has been minimized.
This comment has been minimized.
|
How or why would a counter reset? My understanding was that a counter resets because the process stopped/died and a new process started. Lets say you have a process start up, it counts up to 1000, then resets back to zero (or dies and restarts, or w/e), then it counts up to 9000, then resets back to zero again, and then counts up to 5000 and finally gets scraped for the first time. |
This comment has been minimized.
This comment has been minimized.
|
Even simpler, you don't know when that increase of 5000 happened. It could have been in the last 30 minutes, it could have been last week. |
This comment has been minimized.
This comment has been minimized.
|
This is why I suggested put a timestamp of when the counter started, in the metrics endpoint that clients expose. |
This comment has been minimized.
This comment has been minimized.
|
Now we're going in circles. This is all ground which has been trod before. Prometheus is not suitable for if you care about every single event, you want logging for that. Counters that are slowly and irregularly incremented are also more susceptible to artifacts, if you were to have more traffic you wouldn't see this issue. It is possible your autoscaling system is having oscillation issues which would exasperate this by increasing churn, so I'd look into dampening it. |
brian-brazil
closed this
Feb 26, 2018
This comment has been minimized.
This comment has been minimized.
|
Can you please say why my proposed solution would not work? I believe you are closing this out-of-hand and way too early. |
This comment has been minimized.
This comment has been minimized.
|
I have already gone over this above. While this may all be new to you, I have already spent over a day of my time this month explaining to various users why their suggested improvements to rate are not as easy fixes as they may seem. I have many things to work on, so I'm afraid I can't give everyone a detailed individualised answer every time the same questions come up again and again. https://www.youtube.com/watch?v=67Ulrq6DxwA explains the reasons behind what we currently do. |
JensRantil
added a commit
to tink-ab/opsgenie-prometheus-exporter
that referenced
this issue
Mar 30, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


veqryn commentedFeb 25, 2018
•
edited
What did you do?
Used rate / irate / increase to see how much things have increased. In this case, we had an overall increase in our count by 15.
What did you expect to see?
I expected to see spikes adding up to 15.
What did you see instead?
Spikes adding up to 11.
Environment
Server: Prometheus 2.0.0, Revision 0a74f98
Client: Prometheus Golang Client library (official)
(all our metrics are initialized and published at their zero-state, as per how the golang client library works)
Summary
I built a bunch of graphs and dashboards, but they just didn't seem to being showing accurate data. After investigating and comparing with our logs, I confirmed that all of my graphs and dashboards show data that doesn't match what is actually happening in my system.
The problem appears to be that when using rate / irate / increase to see the rate that metrics increase, all data/metrics that happen before the 1st scrape are completely ignored/dropped by prometheus.
We have a Kubernetes cluster with an auto-scaling cluster. New pods are constantly being created, and old ones are constantly being killed off, or sometimes getting restarted.
Our scrape interval for everything in our cluster is 60 seconds.
Because we auto-scale up and down quite quickly, and because we scale up when we are the busiest, I estimate we lose about 20-40% of our data (not data points, but the actual values; so for example I would be expecting spikes adding up to 5000 over a 12 hour period, yet prometheus shows spikes adding up to 3000, because the largest spikes are all missing).
For example, here we show a bunch of 4 new pods starting up, and processing 15 emails:

But the spikes add up to only 11, because we are missing the first 4 data points:

When you don't stack the first graph, you can the problem is because the 4 pods all managed to process 1 email before the first scrape, so the metric came in at 1 for each of them, instead of starting at zero:

I don't think it is reasonable to ask our processes/pods to startup and then sleep for 60 seconds to ensure that a scrape takes place before any work gets done.
I do not know what the best solution is, but I hope this issue will get addressed, since it makes the rate/irate/increase functions misleading at best, and outright wrong at worst.
One potential idea would be to have the client libraries publish a timestamp of when the prometheus library was initialized or when each metric got registered. Those timestamps would never change for the life of the process/container, so they would only need to be read in and parsed by the prometheus server during the first successful scrape. The server could then see that the metric actually started at zero, at that timestamp.