-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retention time configurable per series (metric, rule, ...). #1381
Comments
|
This is not something Prometheus supports directly at the moment and for the foreseeable future. The focus right now is on operational monitoring, i.e. the "here and now". You can get something like this by using a tiered system. The first-level Prometheus would scrape all the targets and compute the rules. A second-level Prometheus would federate from it, only fetching the result of these rules. It can do so at a lower resolution, but keep in mind that if you set the Additionally, the second-level Prometheus could use the (experimental) remote storage facilities to push these time series to OpenTSDB or InfluxDB as they are federated in. To query these you will need to use their own query mechanisms, there is no read-back support at the moment. |
|
The "5min-problem" is handled by #398. The planned grouping of rules will allow individual evaluation intervals for groups. So something like a "1 hour aggregate" can be configured in a meaningful way. The piece missing is retention time per series, which I will rename this bug into and make it a feature request. We discussed it several times. It's not a high priority right now, but certainly something we would consider. |
|
A per job retention period is what I need for my use-case. I pull 4 metric from my solar panel every 30 second, and want to store them forever (so I can for example go 6 months back and see the production at that momemt) but I don't need that for all the other metric (like Prometheus metric). |
|
Prometheus is not intended for indefinite storage, you want #10. |
I see #10 make sense if you have a lot of time series, but OpenTSDB seems kind of overkill just to store 4 time series forever. Isn't it just a question of allowing people to set retention period to forever? or do you think people will "abuse" that? |
|
We make design decisions that presume that Promtheus data is ephemeral, and can be lost/blown away with no impact. |
|
Coming here from google groups discussion about the same topic |
|
I plan to tackle this today. So essentially it would mean this, regularly calling the delete API and in the background cleaning up the tombstones. Where should this live is the question. My inclination is that we could leverage the delete API itself and then add a tombstone cleanup API, and add functionality to promtool to call the APIs regularly with the right matchers. Else, I would need to manipulate the blocks on disk with a separate tool which I must say, I'm not inclined to do. |
|
One alternative is to make it part of the tsdb tool and "mount" the tsdb
tool under "promtool tsdb", which has other nice benefits.
That would make the functionality usable outside of the Prometheus context.
Prometheus users would need to run 2 extra commands for disable/enable
compaction. Or just wrap those around it when calling via promtool.
…On Wed, Nov 22, 2017 at 6:29 AM Goutham Veeramachaneni < ***@***.***> wrote:
I plan to tackle this today. So essentially it would mean this, regularly
calling the delete API and in the background cleaning up the tombstones.
Where should this live is the question.
My inclination is that we could leverage the delete API itself and then
add a tombstone cleanup API, and add functionality to promtool to call the
APIs regularly with the right matchers.
Else, I would need to manipulate the blocks on disk with a separate tool
which I must say, I'm not inclined to do.
/cc @brian-brazil <https://github.com/brian-brazil> @fabxc
<https://github.com/fabxc> @juliusv <https://github.com/juliusv> @grobie
<https://github.com/grobie>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1381 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEuA8tV_IIlR7d8IDAAyISIhpKG06IHaks5s478rgaJpZM4HXqa7>
.
|
|
My concern there is the edge-cases, what if the request to restart compacting fails? While the tsdb tool makes perfect sense on static data, I think it would be cleaner if we could make it an API on top of For the Having it as an API also allows us to make it a feature of Prometheus if people care and Brian agrees ;) |
|
I wouldn't object to delete and force cleanup functionality being added to promtool. I have a general concern that users looking for this tend to be over-optimising and misunderstanding how Prometheus is intended to be used, such as the original post of this issue. I'd also have performance concerns with all this cleanup going on. |
|
don't think anything can be done on the tsdb side for this so removed the Doesn't seem there is a big demand for such a use case and since the issue is so old maybe should close it and revisit if it comes up again or if @taviLaies is still interested in this. |
|
A few of us had discussions around this at KubeCon and find dynamic retention valuable for both Prometheus and Thanos. Generally the approach we were discussing is to include the tool within the Prometheus code as part of compaction, and allow users to define retention with matchers. Design doc will be coming soon, but I am happy to hear any major concerns around compaction time processing sooner than later so I can include them. |
|
Compaction is currently an entirely internal process that's not exposed to
users, and in particular does not affect query semantics.
I'd prefer to ensure that we'd expose the delete API via promtool, and let
users work themselves from there.
…On Fri 22 Nov 2019, 16:25 Chris Marchbanks, ***@***.***> wrote:
A few of us had discussions around this at KubeCon and find dynamic
retention valuable for both Prometheus and Thanos. Generally the approach
we were discussing is to include the tool within the Prometheus code as
part of compaction, and allow users to define retention with matchers.
Design doc will be coming soon, but I am happy to hear any major concerns
around compaction time processing sooner than later so I can include them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1381?email_source=notifications&email_token=ABWJG5RNVF2DGM4KKJ4LVYTQU72YLA5CNFSM4B26U252YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE5645Y#issuecomment-557575799>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABWJG5TUETHIMM6H36RYWNTQU72YLANCNFSM4B26U25Q>
.
|
|
I think in this case it's justified to expose this to users – "I want to keep some metrics longer than others" is such a common use case that I don't think we should relegate it to "write your own shell scripts". The impact on query semantics doesn't have to be explicitly bound to compaction – it can simply be "samples will disappear within X hours after they have reached their retention period". |
|
There's a few unrelated things being tied together there. One thing we do know is that users tend to be over-aggressive in their settings, which then causes them significant performance impact. This is why we don't currently have a feature in this area, the last person to investigate it found it to not work out in practice.
It'd be a single curl/promtool invocation, so it's not something that even really classifies as a shell script. |
|
It would still need to be executed regularly to fulfill the need. So it
needs to be scheduled, monitored, updated.
When would the corresponding space be freed?
…On Mon, Nov 25, 2019 at 11:30 AM Brian Brazil ***@***.***> wrote:
There's a few unrelated things being tied together there. One thing we do
know is that users tend to be over-aggressive in their settings, which then
causes them significant performance impact. This is why we don't currently
have a feature in this area, the last person to investigate it found it to
not work out in practice.
is such a common use case that I don't think we should relegate it to
"write your own shell scripts".
It'd be a single curl/promtool invocation, so it's not something that even
really classifies as a shell script.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1381?email_source=notifications&email_token=AABAEBTBGXUAUXYKI7XPUETQVOZNXA5CNFSM4B26U252YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFCCRIY#issuecomment-558114979>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABAEBUUG7YFKLHJPDI57K3QVOZNXANCNFSM4B26U25Q>
.
|
Cron covers that largely, plus existing disk space alerting.
Typically it'd be automatically within 2 hours. Unless they trigger it manually (which is where performance problems tend to come in, this gets triggered far too often). |
|
How close can one get to an ideal scenario where a user is not made to worry about what to retain for how long, but instead the system adapts to a storage quota? It could track actual query usage of metrics and their time windows, so it can predict metrics / times that are likely to be unneeded, and prefer them for disposal. |
|
Sorry, I was unclear. The storage documentation already says that blocks will not get cleaned up for up to two hours after the have exceeded the retention setting. E.g. with a retention of 6 hours, I can still query data from 8 - 10 hours ago. |
|
That's only at the bounds of full retention, and IMHO we should keep the 1.x behaviour of having a consistent time for that. It's not the last few hours of data with typical retention times. |
|
I have started a design doc for this work here: https://docs.google.com/document/d/1Dvn7GjUtjFlBnxCD8UWI2Q0EiCcBEx_j9eodfUkE8vg/edit?usp=sharing All comments are appreciated! |
|
Is there any progress on this issue? I had similar problem. I want to monitor total errors count on networks switches, but on some of them there isn't snmp oid for total errors. So i should get different types of errors(CRC, Aligment, Jabber etc.) and calculate sum of them. But i want to keep only total errors, not others. |
|
No progress to report, there are still many unresolved comments in the design doc I put forward, and I have not had the time or energy required to get consensus. There is some work related to this in Thanos that has been proposed for Google Summer of Code (thanos-io/thanos#903). If you only need to delete certain well known series, calling the delete series api on a regular schedule is an option. |
FWIW I can live with that. Ceph already behaves sort of this way when deleting RBD volumes. In my situation, there are metrics that aren't likely to be useful past, say, 30 days like network stats. Others could have value going back for months, eg. certain Ceph capacity, performance, etc. metrics. Ideally I'd love to be able to downsample older metrics - maybe I only need to keep one per day. Use-case: feeding Grafana dashboards and ad-hoc queries. The federation idea is clever and effective, but would complicate the heck out of queries. I would need to duplicate dashboards across the 2 (or more) datasources which doesn't make for the best user experience, and is prone to divergence. |
|
Has this been looked into any further by the development team? Or have any users found any work arounds? This would help me a lot with my dashboards. |
|
My workaround is to deploy a VictoriaMetrics next to the prometheus. Then configure Victoria to scrape prometheus but filter which metrics to scrape, have different retention, and loose granularity. Command flags: - -retentionPeriod=120 # 120 months
- -dedup.minScrapeInterval=15mpromscrape.config scrape_configs:
- job_name: prometheus
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__="metric1"}'
- '{__name__="metric2"}'
static_configs:
- targets:
- 'prometheus:9090' |
|
one workaround is a setup with multiple prometheus services having different configuration (plus/or thanos depending on the scenario)
11.12.2020 16:39:43 Rezeye <notifications@github.com>:
… Has this been looked into any further by the development team? Or have any users found any work arounds? This would help me a lot with my dashboards.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub[#1381 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAZQPLR5BR5Z2NVT3IRGDTTSUI4L3ANCNFSM4B26U25Q].
[data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEIAAABCCAYAAADjVADoAAAAAXNSR0IArs4c6QAAAARzQklUCAgICHwIZIgAAAAnSURBVHic7cEBDQAAAMKg909tDwcUAAAAAAAAAAAAAAAAAAAAAPBjRFIAASHKmKkAAAAASUVORK5CYII=###24x24:true###][Tracking-Bild][https://github.com/notifications/beacon/AAZQPLWPWHRPE4QLFFPSQELSUI4L3A5CNFSM4B26U252YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFRGVSLY.gif]
|
We decided to put in place clean up policy (Prometheus REST API) - default is 12 weeks, but a part is cleaned up after 6 weeks. We considered this cheaper to maintain than several instances. |
There is a topic for a prometheus dev summit to discuss this issue. I am hopeful that we will discuss it either this month or in January, but I cannot say for sure. After the discussion, we will be able to provide a more complete answer as to how we would like this in (or not in) Prometheus.
The current ways to do this are either with federation to a second Prometheus instance, or having an external process call the admin delete API. |
|
This would be a really great feature for prometheus. The use case of this feature is not only for long term metrics (which some people argued in the comments that is not the prometheus intend). There are lots of expensive metrics which we want to be able to have them only for a single day, but the rest of the metrics for 15 days. So now we have to operate 2 prometheus instances. Apart from operation, some of our queries needs matching operators to filter the metrics based on the series in the other instance, so it also makes it harder. I know there is thanos option for global query of multiple prometheis, but it is overkill to use it only for not being able to retain some metrics for a shorter time. |
|
It would be nice to revisit this 🤗 There are big wins if we have something like this: Prioritizing, Aggregations, satisfying data for alert only and discard etc cc @csmarchbanks wonder if it's time resurrect your proposal (: |
|
If it doesn't get discussed at the upcoming dev summit, perhaps let's get a few interested parties together to get it moving without a dev summit? It's been on the agenda with a fair number of votes for quite awhile now so I hope it gets discussed. |
|
Good news! There was consensus in today's dev summit that we would like to implement dynamic retention inside of the Prometheus server. The next step is to decide how we would like to implement this feature. Right now it looks like there are two proposals in the document I linked, one for a new format that allows reducing or extending retention based on a set of matchers, and a second building on rule evaluation to delete data that is older than age. Anyone who is interested, please provide feedback on either of those approaches (or a new one) so that implementation work can begin. |
|
Hi, I would like to tackle this issue as my GSoC'21 project.
Let me know if there are any other existing discussions I should read. As a first step, I'm going to do some code readings and figure out the dependencies. I have several questions but I'm not sure what steps I should take, so I want to decide how to proceed first. |
|
@FujishigeTemma Those are great discussions to start, if you have questions about GSoC feel free to reach out to me via email or in the CNCF slack. Otherwise, part of the GSoC project will be to make sure a design is accepted and then start implementing it. |
|
As this project was not selected in GSoC this year, do we have any other updates or progress on this? |
There is no progress. |
|
Based on the design doc https://docs.google.com/document/d/1Dvn7GjUtjFlBnxCD8UWI2Q0EiCcBEx_j9eodfUkE8vg/edit#, for config like: retention_configs:
- retention: 1w
matchers:
- {job=”node”}
- retention: 60d
matchers:
- slo_errors_total{job=”my-service”}
- slo_requests_total{job=”my-service”}
- retention: 2w
matchers:
- {job=”my-service”}An approach would be:
To achieve 2, we can extend the compactor interface with modifiers like #9413: We can define a retention time & matchers aware modifier to only keep the chunkSeries we want or simply use ChunkQuerier to get the ChunkSeriesSet using the given matchers. Implementation for modifier that goes through each series: https://github.com/yeya24/prometheus/blob/experiment-modify/tsdb/modifiers.go#L202-L277 |
|
@yeya24 Is the implementation going to merge ? What is the stage of the design doc? |
Hello,
I'm evaluating prometheus as our telemetry platform and I'm looking to see if there's a way to set up graphite-like retention.
Let's assume I have a retention period of 15d in prometheus and I define aggregation rules that collapse the samples to 1h aggregates. Is there a way to keep this new metric around for more than 15 days?
If this is not possible, could you provide some insight on how you approach historical data in your systems?
Thank you
The text was updated successfully, but these errors were encountered: