Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove staleness handling #398
Comments
This comment has been minimized.
This comment has been minimized.
|
I'm personally just affected by the pushgateway issues: My batch job is running every 10 minutes and it pushes a metric with timestamp to the pushgateway. Prometheus scrapes it and can reach it just fine, so one would expect that it persists that data point. But since there are more than 5 minutes between the date points, it considers them as stale. |
This comment has been minimized.
This comment has been minimized.
|
If you have a good idea how to redesign the current scraping/storage integration to allow for this (efficiently), let me know! |
This comment has been minimized.
This comment has been minimized.
|
@discordianfish Regarding your issue with the pushgateway: the expectation is that you normally wouldn't explicitly assign timestamps to pushed samples. That is only for power users. Just send the sample value, and Prometheus will attach a current timestamp to the pushed sample value on every scrape. Of course you need to get used to the semantics then that the timestamp is from the last scrape and not the last push, but this is the expected use case. |
This comment has been minimized.
This comment has been minimized.
|
@discordianfish Yup. The timestamp field is in most cases not what you want. The typical use-case where you want to report something like the time of completion of a batch job via the pushgateway, you would create a metric last_completion_time_seconds and put the Unix time into it as a value. The timestamp in the exchange format really means "scrape time", and you really need to know what you do if you want to manipulate that. |
This comment has been minimized.
This comment has been minimized.
|
Well, you can't tell whether a metric is up-to-date but just '42' each time or if the thing pushing to the pushgateway stopped working and the '42' is just the latest result. I thought I could just add timestamps for that, raise the stalenessDelta and just alert if the metric is gone. But something like last_completion_time_seconds will do the job as well. |
brian-brazil
added
enhancement
labels
Jan 6, 2015
brian-brazil
referenced this issue
Aug 9, 2015
Merged
Don't warn about equal timestamps during append. #973
brian-brazil
referenced this issue
Aug 24, 2015
Merged
promql: Remove interpolation of vector values. #1006
brian-brazil
referenced this issue
Oct 21, 2015
Closed
promql: Remove extrapolation from rate/increase/delta. #1161
This comment has been minimized.
This comment has been minimized.
|
I was just thinking there from the rate discussion, and have a sketch of a solution. The two basic cases we want to solve are:
In addition we want something that'll produce the same results now as back in time, and across Prometheus restarts. My idea is to add two new values that a sample can have: "fresh" and "stale". These would be persisted like normal samples, but not directly exposed to users. When we get a scrape with timestamps set on a sample we don't have, we'd add a sample for the exported value, and a 2nd one with the "fresh" value and the scrape time. If we get a sample we already have, then we add a new sample with the "fresh" value and scrape time. When querying for a given time with an instant selector, if we get a "fresh" then we walk back until we get the actual sample. For a range selector we'd ignore that sample. When a scrape fails, a evaluation no longer produces certain timeseries or a target is removed we'd add in a "stale" on all affected time-series. For an instant selector if the first sample we hit is stale, we stop and return nothing. For a range selector we'd ignore the stale samples, and return the other samples as usual (irate needs special handling here as it's more instant in semantics - could we change it to an instant selector?, and we'll need to be careful with SD blips). There's various corner cases and details to be worked out, but this seems practical and has the right semantics. |
This comment has been minimized.
This comment has been minimized.
I believe the actual challenge in staleness handling is to identify which time-series are "affected" in the above sense. To do that, we needed to track which target exported which time-series in their respective last scrape. All the updates and lookups required are not trivial to get right and fast, especially in a highly concurrent, high throughput scenario. |
brian-brazil
referenced this issue
Dec 28, 2015
Closed
promql: Limit extrapolation of delta/rate/increase #1245
brian-brazil
referenced this issue
Feb 11, 2016
Closed
Add optional timestamps to exposition data #22
beorn7
referenced this issue
Feb 13, 2016
Open
Retention time configurable per series (metric, rule, ...). #1381
This was referenced Apr 15, 2016
fabxc
added
kind/enhancement
and removed
feature request
labels
Apr 28, 2016
beorn7
referenced this issue
May 27, 2016
Closed
increase() should consider creation of new timeseries as reset #1673
brian-brazil
referenced this issue
Jul 5, 2016
Closed
Added support for pushing with a timestamp #126
brian-brazil
referenced this issue
Jul 13, 2016
Closed
It takes Prometheus 5m to notice a metric is not available #1810
This was referenced Aug 11, 2016
beorn7
referenced this issue
Aug 15, 2016
Open
Make rule evaluation and federation consistent in time #1893
This comment has been minimized.
This comment has been minimized.
|
That has similar issues to the difference in values, as the client needs to track scrapes. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Oct 25, 2016
|
Where is the "staleness configuration" for a timeseries? You mention that it's 5m by default and configurable. I have a system that, currently, can only be scrapped every 4 hours and I want my graphs to have valid data-points for up to that time without gaps. |
This comment has been minimized.
This comment has been minimized.
|
@szxnyc The flag is `--query.staleness-delta``. |
frodenas
referenced this issue
Dec 2, 2016
Closed
Metrics showing for a non-existent CF component after scale down #17
brian-brazil
referenced this issue
Dec 28, 2016
Closed
Need for timestamp, because some sources may be with a latency #2308
This comment has been minimized.
This comment has been minimized.
|
https://docs.google.com/document/u/1/d/1ordTPfUSaGvaBolUGeLChATwgc1xUhUxfUhudYoOeMw/edit is a proposal for how to handle this. |
brian-brazil
added this to the v2.x milestone
Apr 10, 2017
juliusv
referenced this issue
Apr 14, 2017
Closed
Should containerd provide image filesystem metrics? #678
This comment has been minimized.
This comment has been minimized.
|
The base form of this is in place for scrapes in the dev-2.0 branch if ye'd like to try it out. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Can you build a new alpha release in order to test it? |
This comment has been minimized.
This comment has been minimized.
|
I believe there's other changes pending before the next alpha. This is only half of staleness handling, recording rules need the same logic added too. |
This comment has been minimized.
This comment has been minimized.
|
@AndreaGiardini sync'd with @brian-brazil again and we concluded the amount of fixes and new experimental features justifies another alpha. |
This comment has been minimized.
This comment has been minimized.
WIZARD-CXY
commented
May 19, 2017
|
looking forward to the new release |
This comment has been minimized.
This comment has been minimized.
|
Remaining core staleness work is now all out for review. |
This comment has been minimized.
This comment has been minimized.
|
This is all implemented in 2.0 with the pgw changes just awaiting a release. Still need docs, but that'll happen with the rest of 2.0 docs. |
brian-brazil
closed this
Jun 6, 2017
simonpasquier
pushed a commit
to simonpasquier/prometheus
that referenced
this issue
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
alok87
commented
Dec 20, 2017
•
|
@brian-brazil I used prometheus/cloudwatch_exporter with below config to collect data for
Data did not get updated even after more than 10mins(14mins approx) |
This comment has been minimized.
This comment has been minimized.
alok87
commented
Dec 20, 2017
|
The above was because of the cloudwatch gets updated with latest data every 5mins. Also i changed the delay_seconds to 10seconds in cloudwatch_exporter. Now the data gets refreshed in max 5mins. |
shollingsworth
referenced this issue
Apr 13, 2018
Merged
Treat custom textfile metric timestamps as errors #769
KodyKantor
referenced this issue
Dec 14, 2018
Open
Stale metrics reported after shard changes state #22
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


brian-brazil commentedJul 15, 2014
Currently prometheus considers a timeseries stale when it's got no data points more recent than 5m (configurable).
This causes dead instances' data to hang around for longer than it should, makes less often scrape intervals difficult, makes advanced self-referential use cases more difficult and prevents the timestamp propagating from the push gateway.
I propose that we consider a timeseries stale if it wasn't present in the most recent scrape of the target we get it from.