Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase() should consider creation of new timeseries as reset #1673

Closed
discordianfish opened this Issue May 27, 2016 · 24 comments

Comments

Projects
None yet
8 participants
@discordianfish
Copy link
Member

discordianfish commented May 27, 2016

Right now if a time series didn't exist and comes into existence with value 1, increase() returns 0 since Prometheus doesn't know if the counter actually was increased or simply scraped for the first time.
Where this is technically right, there are use cases where you can't set a timeseries to 0 before the counter gets increased the first time. Think of any kind of low frequency event tracking where you can't expose that a even didn't happen (set to 0) before you saw the event the first time.

Alternatively to changing increase(), we could also add a function to prefix a timeseries with a 0 value. Something that would take a scalar and range vector and return the range vector with the scalar prefix if it has no value.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

I don't think this is workable. We can't distinguish between for example a freshly restarted Prometheus, a gap in the data relative to the range you're looking at, and a target that just came up and immediately incremented a counter.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 27, 2016

The use case is clear, and we discussed this before.

I still cannot see any different outcome. It is not only about "being scraped for the first time". It might also be a cut-off because of retention. Or a missed scrape. Or what not.

How would that proposed prefix function work? When exactly would it prefix the time series? Based on the gap towards the beginning of the range?

I think to solve the problem, we need a notion of "the time series begins here". That coincides with the solution to #398.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 27, 2016

Brian won the race by a few seconds... ;)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

I'm not sure #398 helps us here. That's about when time series stop rather than where they start.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 27, 2016

I see stop and start of a series as two sides of the same coin. I expect any sane solution for the one to be a solution for the other, too. Which then solves a whole class of related problems (like the rate extrapolation discussed elsewhere, or the detection of gaps for the purpose of backfilling, and, incidentally, it will also solve this use case).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

It still wouldn't let us tell when an increase occurred, it could have been a month ago or a second ago.

Staleness is intended to only affect how we deal with instant vectors, range vector semantics would remain unchanged.

@discordianfish

This comment has been minimized.

Copy link
Member Author

discordianfish commented May 27, 2016

What worries me is that there doesn't any reasonable workaround for the problem. If you have an exporter that exports counter metrics from a 3rd party system and has no notion about what metrics to expect (so it can expose them with value 0 before a counter increase happens), there is no way to access the first counter increase.

How would that proposed prefix function work? When exactly would it prefix the time series? Based on the gap towards the beginning of the range?

I see how this is far from ideal, but something like that would solve our problem. But the start/stop of series seems like a better solution. If the elements in a range vector would indicated if it's the first value, increase (or a new op) could assume that before that value the value was 0.

But you know better how to implement that. I'm just saying that we have a use cases and without this being solved, we need to build a one-off from scratch to do such alerting or come up with ugly hacks.

Unless of course you know some better workaround. Or a simpler feature that could be used as workaround. In the end it's about alerting if increase(some_error_counter[5m]) > 0 which should trigger even if some_error_counter with the matching label pairs came into existence with value 1. I'd argue this is a general problem where ever you have some system emitting metrics 'once something happens' (vs setting counters initially and then increasing them).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

Why not do some_error_counter > 0 ?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 27, 2016

It still wouldn't let us tell when an increase occurred, it could have been a month ago or a second ago.

If you had a notion of the "true" start of the series (in contrast to "I just started to scrape this target" or "I just started the Prometheus server for the first time" or "the network just came back"), you could answer the question of the OP: "Did this error count of 1 just sprang into existence, or was that error counter at one for quite some time and I just didn't notice for other reasons?"

Staleness is intended to only affect how we deal with instant vectors, range vector semantics would remain unchanged.

That might have been the original intention. But we already realized it is a hard problem to solve. The longer I'm thinking about the problem, the more I realize that it is the solution to a whole class of problems. I expect that you cannot solve the problem of staleness in instant queries without also solving all those other problems.

@discordianfish

This comment has been minimized.

Copy link
Member Author

discordianfish commented May 27, 2016

@brian-brazil Because I nobody will reset the error counter. I need to alert to recover on it's own. And yes, I know I could also change the world and make whatever is erroring expose 0 if there is no error and this would be the better setup from a Prometheus perspective but it's simply not possible currently. It might change in the future, but there will be always similar use cases which is a problem if you want to rely on prometheus as the sole monitoring system.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

I expect that you cannot solve the problem of staleness in instant queries without also solving all those other problems.

The solution and semantics I've sketched out only cover instant selectors. They give us no additional data that's useful for ranges that I can see, as the actual data points are all outside the range. Notification of staleness doesn't help much, as we're already estimating that the series stopped half an interval from the last data point.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 27, 2016

there will be always similar use cases which is a problem if you want to rely on prometheus as the sole monitoring system.

You're caring here about individual events, and spotting 100% of them - neither is suited to Prometheus. Prometheus is intended to be one tool in the toolbox, not the solution to all possible problems.

You might be able to hack something together with offset and unless but this doesn't fit the general Prometheus philosophy of counters.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 27, 2016

The solution and semantics I've sketched out only cover instant selectors.

Yes. And I don't think the solution as sketched out will ultimately work.

Notification of staleness doesn't help much, as we're already estimating that the series stopped half an interval from the last data point.

That's an estimation, which could be turned into a precise calculation once we have a notion of start and end of a series. With that notion, we can also use the irate problem.

But this is clearly derailing this issue.

@fortytw2

This comment has been minimized.

Copy link

fortytw2 commented Jan 13, 2017

Is there a workaround for this? I believe this is the cause of counter issues around deploys for us (rate drops to "0" during deploys) as new deploys = new pods = new timeseries that don't start at 0.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 13, 2017

There is no workaround, as this problem is unsolvable in general.

@fortytw2

This comment has been minimized.

Copy link

fortytw2 commented Jan 13, 2017

Right, this exact problem is unsolvable in general - but I think that I can cheat my way around it - if I set up relabel configurations that drop pod_name, making every deploy not a new timeseries, just a counter reset, my strange rate() / increase() problems should subside? (writing my way through this, just want to check that my thought process is sound)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 13, 2017

That depends on how fast the counter increases. If it's low there may not be an apparent reset.

@fortytw2

This comment has been minimized.

Copy link

fortytw2 commented Jan 13, 2017

Great! confirming that dropping the labels that change per-deploy seems to have "fixed" this problem for me (in case it helps someone in the future 🙂 )

+  - target_label: pod_name
+    replacement: ''
+  - target_label: instance
+    replacement: ''
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 13, 2017

Dropping the instance label is a bad idea, as that's the primary label used for that purpose. It'll also default to __address__ if you do that.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 13, 2017

Given that this problem appears unsolvable and no general solutions have been suggested, I'm closing this off.

@tim-d-blue

This comment has been minimized.

Copy link

tim-d-blue commented May 30, 2018

an unsolvable problem. this is a pretty sad result, as a monitoring solution it holds little value to estimate metrics.

@Sebubu

This comment has been minimized.

Copy link

Sebubu commented Nov 22, 2018

Isn't it just possible to implement an additional function like zero_increase() which starts at zero unlike increase()?

Then the user can decide what behavior he likes to have.

@pdambrauskas

This comment has been minimized.

Copy link

pdambrauskas commented Nov 28, 2018

I agree. We cannot have alerts configured, based on increase() function result, since we lose big chunk of data.
It makes Prometheus not an option for us :(

Wouldn't that be possible to have some other method for that purpose increase_with_default(default_value, ...) or similar, to fill in missing gaps with default value?

We can't distinguish between for example a freshly restarted Prometheus, a gap in the data relative to the range you're looking at, and a target that just came up and immediately incremented a counter.

by providing some increase alternative, we would leave this responsibility to Prometheus user.

Edit:

find some kind of workaround for my use case:

(sum(flow_events{type="fail"} or flow_events{} * 0) by (source) - sum(flow_events{type="fail"} offset 5m or flow_events{} * 0) by (source))
/
(sum(flow_events{}) by (source) - sum(flow_events{} offset 5m) by (source))

In this case I calculate increase without using PromQl function, and replace null values by or flow_events{} * 0, however you must have some kind of events with same labels to make this possible (in my case it is flow_events without type specified)

@shuz

This comment has been minimized.

Copy link

shuz commented Jan 29, 2019

We also hit this issue in one of our CounterVec. It's hard for us to pre-populate all label combinations. And the behavior is different if it's a reset vs new TS. So it breaks when we move to Kubernetes.

I do feel the suggestion of a new function zero_increase() will be quite helpful in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.