Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upincrease() should consider creation of new timeseries as reset #1673
Comments
This comment has been minimized.
This comment has been minimized.
|
I don't think this is workable. We can't distinguish between for example a freshly restarted Prometheus, a gap in the data relative to the range you're looking at, and a target that just came up and immediately incremented a counter. |
This comment has been minimized.
This comment has been minimized.
|
The use case is clear, and we discussed this before. I still cannot see any different outcome. It is not only about "being scraped for the first time". It might also be a cut-off because of retention. Or a missed scrape. Or what not. How would that proposed prefix function work? When exactly would it prefix the time series? Based on the gap towards the beginning of the range? I think to solve the problem, we need a notion of "the time series begins here". That coincides with the solution to #398. |
This comment has been minimized.
This comment has been minimized.
|
Brian won the race by a few seconds... ;) |
This comment has been minimized.
This comment has been minimized.
|
I'm not sure #398 helps us here. That's about when time series stop rather than where they start. |
This comment has been minimized.
This comment has been minimized.
|
I see stop and start of a series as two sides of the same coin. I expect any sane solution for the one to be a solution for the other, too. Which then solves a whole class of related problems (like the rate extrapolation discussed elsewhere, or the detection of gaps for the purpose of backfilling, and, incidentally, it will also solve this use case). |
This comment has been minimized.
This comment has been minimized.
|
It still wouldn't let us tell when an increase occurred, it could have been a month ago or a second ago. Staleness is intended to only affect how we deal with instant vectors, range vector semantics would remain unchanged. |
This comment has been minimized.
This comment has been minimized.
|
What worries me is that there doesn't any reasonable workaround for the problem. If you have an exporter that exports counter metrics from a 3rd party system and has no notion about what metrics to expect (so it can expose them with value 0 before a counter increase happens), there is no way to access the first counter increase.
I see how this is far from ideal, but something like that would solve our problem. But the start/stop of series seems like a better solution. If the elements in a range vector would indicated if it's the first value, increase (or a new op) could assume that before that value the value was 0. But you know better how to implement that. I'm just saying that we have a use cases and without this being solved, we need to build a one-off from scratch to do such alerting or come up with ugly hacks. Unless of course you know some better workaround. Or a simpler feature that could be used as workaround. In the end it's about alerting |
This comment has been minimized.
This comment has been minimized.
|
Why not do |
This comment has been minimized.
This comment has been minimized.
If you had a notion of the "true" start of the series (in contrast to "I just started to scrape this target" or "I just started the Prometheus server for the first time" or "the network just came back"), you could answer the question of the OP: "Did this error count of 1 just sprang into existence, or was that error counter at one for quite some time and I just didn't notice for other reasons?"
That might have been the original intention. But we already realized it is a hard problem to solve. The longer I'm thinking about the problem, the more I realize that it is the solution to a whole class of problems. I expect that you cannot solve the problem of staleness in instant queries without also solving all those other problems. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Because I nobody will reset the error counter. I need to alert to recover on it's own. And yes, I know I could also change the world and make whatever is erroring expose 0 if there is no error and this would be the better setup from a Prometheus perspective but it's simply not possible currently. It might change in the future, but there will be always similar use cases which is a problem if you want to rely on prometheus as the sole monitoring system. |
This comment has been minimized.
This comment has been minimized.
The solution and semantics I've sketched out only cover instant selectors. They give us no additional data that's useful for ranges that I can see, as the actual data points are all outside the range. Notification of staleness doesn't help much, as we're already estimating that the series stopped half an interval from the last data point. |
This comment has been minimized.
This comment has been minimized.
You're caring here about individual events, and spotting 100% of them - neither is suited to Prometheus. Prometheus is intended to be one tool in the toolbox, not the solution to all possible problems. You might be able to hack something together with |
This comment has been minimized.
This comment has been minimized.
Yes. And I don't think the solution as sketched out will ultimately work.
That's an estimation, which could be turned into a precise calculation once we have a notion of start and end of a series. With that notion, we can also use the But this is clearly derailing this issue. |
This comment has been minimized.
This comment has been minimized.
fortytw2
commented
Jan 13, 2017
|
Is there a workaround for this? I believe this is the cause of counter issues around deploys for us (rate drops to "0" during deploys) as new deploys = new pods = new timeseries that don't start at 0. |
This comment has been minimized.
This comment has been minimized.
|
There is no workaround, as this problem is unsolvable in general. |
This comment has been minimized.
This comment has been minimized.
fortytw2
commented
Jan 13, 2017
•
|
Right, this exact problem is unsolvable in general - but I think that I can cheat my way around it - if I set up relabel configurations that drop |
This comment has been minimized.
This comment has been minimized.
|
That depends on how fast the counter increases. If it's low there may not be an apparent reset. |
This comment has been minimized.
This comment has been minimized.
fortytw2
commented
Jan 13, 2017
|
Great! confirming that dropping the labels that change per-deploy seems to have "fixed" this problem for me (in case it helps someone in the future
|
This comment has been minimized.
This comment has been minimized.
|
Dropping the instance label is a bad idea, as that's the primary label used for that purpose. It'll also default to |
This comment has been minimized.
This comment has been minimized.
|
Given that this problem appears unsolvable and no general solutions have been suggested, I'm closing this off. |
brian-brazil
closed this
Feb 13, 2017
This comment has been minimized.
This comment has been minimized.
tim-d-blue
commented
May 30, 2018
|
an unsolvable problem. this is a pretty sad result, as a monitoring solution it holds little value to estimate metrics. |
This comment has been minimized.
This comment has been minimized.
Sebubu
commented
Nov 22, 2018
|
Isn't it just possible to implement an additional function like Then the user can decide what behavior he likes to have. |
This comment has been minimized.
This comment has been minimized.
pdambrauskas
commented
Nov 28, 2018
•
|
I agree. We cannot have alerts configured, based on Wouldn't that be possible to have some other method for that purpose
by providing some Edit: find some kind of workaround for my use case:
In this case I calculate |
This comment has been minimized.
This comment has been minimized.
shuz
commented
Jan 29, 2019
|
We also hit this issue in one of our CounterVec. It's hard for us to pre-populate all label combinations. And the behavior is different if it's a reset vs new TS. So it breaks when we move to Kubernetes. I do feel the suggestion of a new function zero_increase() will be quite helpful in this situation. |
discordianfish commentedMay 27, 2016
•
edited
Right now if a time series didn't exist and comes into existence with value 1, increase() returns 0 since Prometheus doesn't know if the counter actually was increased or simply scraped for the first time.
Where this is technically right, there are use cases where you can't set a timeseries to 0 before the counter gets increased the first time. Think of any kind of low frequency event tracking where you can't expose that a even didn't happen (set to 0) before you saw the event the first time.
Alternatively to changing increase(), we could also add a function to prefix a timeseries with a 0 value. Something that would take a scalar and range vector and return the range vector with the scalar prefix if it has no value.