Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upConfigurable limit for Prometheus's disk space usage #968
Comments
This comment has been minimized.
This comment has been minimized.
|
(oopsie, silly mobile interface, making the submit button so big and so close to the comment box) place; or using a "minfree" parameter to tell Prometheus to keep an eye on the filesystem free space and not to get too close to full. |
This comment has been minimized.
This comment has been minimized.
|
I could imagine extending our API deletion endpoint to delete certain ranges of time series or simpler older-than samples. This way one could manually and explicitly cut off old time series and then restart with a new retention time. Ideally this would be available through This would still need added support from the storage layer, so @beorn7 probably has an opinion on this. He is on vacation though. |
brian-brazil
added
the
feature-request
label
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
|
Thinking about this, I'm not sure if catering for the "disk completely full" case makes a lot of sense. Experience tells us that the storage is often in very sorry state if Prometheus runs into a disk-full scenario. You probably cannot shut down cleanly in that state, and the crash recovery will need its own share of disk, so an immediate deletion of time series will only help in the cases where you have just enough disk space for crash recovery but not enough to wait for normal purging... I was thinking a couple of time about a "storage tool", i.e. a stand alone command line tool that can be used to manipulate and analyze the on-disk storage in cold state. I think the most helpful feature in Prometheus itself would the suggested flags for keeping a minimum free space on the filesystem and/or limit the maximum size of Prometheus on disk. This is kind of similar to #455 – not trivial to implement but very helpful for easy operations. |
beorn7
changed the title
Option for immediate deletion of obsolete series on startup
Configurable limit for Prometheus's disk space usage
Jan 11, 2016
beorn7
self-assigned this
Jan 11, 2016
fabxc
added
kind/enhancement
and removed
feature request
labels
Apr 28, 2016
This comment has been minimized.
This comment has been minimized.
|
This will be much easier to implement in v2.0. Since no implementation details have been discussed here yet, we can leave this issue open to track the work for implementing this in 2.0. I will however unassign this issue from myself. This might actually be a nice starter project for a new contributor. |
beorn7
removed their assignment
Apr 4, 2017
beorn7
added
the
help wanted
label
Apr 4, 2017
This comment has been minimized.
This comment has been minimized.
|
I'm wary of this feature, switching from retention time to a disk space limit is going to have the same fundamental operational problems but in different form. If the goal is an easy way to tactically reduce disk usage while oncall, v2 storage should allow for that. |
brian-brazil
added
priority/Pmaybe
component/local storage
not-as-easy-as-it-looks
and removed
help wanted
labels
Jul 14, 2017
brian-brazil
referenced this issue
Mar 22, 2018
Closed
limiting storage usage, instead of setting -storage.tsdb.retention? #3999
This comment has been minimized.
This comment has been minimized.
fanyangCS
commented
Mar 22, 2018
|
@brian-brazil, how about implement this feature and let user decide which mode to use? storage limit or retention time. I can see both modes make sense for a certain scenario. |
This comment has been minimized.
This comment has been minimized.
davrodpin
commented
Mar 22, 2018
We've been using Prometheus for more than an year in production and our deployment model consists on independent devices with limited amount of storage limit which is shared among other services as well and our support team doesn't have easy access to them for maintenance. Having an native option to limit the max. amount of storage Prometheus use would help us a lot. |
This comment has been minimized.
This comment has been minimized.
SaketMahajani
commented
Mar 22, 2018
|
I agree this would be quite useful for limited capacity deployments. Is there no way to limit both retention as well as storage instead of either/or ? |
This comment has been minimized.
This comment has been minimized.
|
Please don't post metoos in bugs, it causes clutter. |
This comment has been minimized.
This comment has been minimized.
mtknapp
commented
Jun 1, 2018
|
Because this issue is still open I assume there exists some interest in limiting data retention in another way. I've looked into a few ways that data retention could be limited by the amount of data that exists rather than how old the data is like aecolley suggested. Data could be hard capped at a "storage.local.maxbytes" (or possibly limit to a percentage of the drive?), but a few alternatives would be to make sure than there's always still at least N bytes available in the drive or that at least N% of the drive is always unused. And of course it could always be an options for the user to decide whether or not they want to retain by time or storage used by adding another flag. The additions needed are pretty small and I was looking for some feedback/opinions. |
This comment has been minimized.
This comment has been minimized.
|
KISS would imply we try to limit the amount of options. I would personally be happy to have
If more than one of those options is set, it would make sense to connect them with OR as that's the most likely user intent. @MarkTKnapp would you be willing to create a minimal design doc and then PR for this? It would probably make sense to bounce this discussion off of the dev list and link to this issue from there, and back to the ML archives from here. |
This comment has been minimized.
This comment has been minimized.
|
@MarkTKnapp that sound like a good feature to have and it seems it has a fair use cases so definitely start a discussion in the dev mailing list. It has a wider audience so we can get more opinions about the use cases and possible complications. |
mknapphrt
referenced this issue
Nov 9, 2018
Merged
Added storage size based retention method and new metrics #343
This comment has been minimized.
This comment has been minimized.
|
This is now solved by #4230 |
aecolley commentedAug 6, 2015
Prometheus needs a startup flag or HTTP endpoint to trigger an immediate maintenance sweep to delete old files from local storage.
Background: when disk usage hits 100% (usually due to bad planning), a reasonable recovery strategy is to reduce the storage.local.retention value and restart Prometheus. There are two problems with this strategy. First, Prometheus can't make any progress if it can't write to new files. Second, local/storage.go waits for 10% of the retention duration before it begins the first maintenance sweep. This feature request is about the second problem, not the (more difficult) first.
Advising people to make free with rm under the prometheus storage dir makes me queasy, even if it's safe to remove the older series files. For operations work, we need a less-risky procedure for the sleep-deprived pager-carrier who is facing a full disk and a stuck Prometheus.
I want to say: stop Prometheus; delete the orphaned directory to clear up some space; then start it up with a shorter storage.local.retention; and POST to /force-maintenance-sweep; then watch the disk usage drop.
Possible alternatives that I can think of: using a new storage.local.maxbytes limit to stop the problem arising in the first