Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable limit for Prometheus's disk space usage #968

Closed
aecolley opened this Issue Aug 6, 2015 · 13 comments

Comments

Projects
None yet
@aecolley
Copy link

aecolley commented Aug 6, 2015

Prometheus needs a startup flag or HTTP endpoint to trigger an immediate maintenance sweep to delete old files from local storage.

Background: when disk usage hits 100% (usually due to bad planning), a reasonable recovery strategy is to reduce the storage.local.retention value and restart Prometheus. There are two problems with this strategy. First, Prometheus can't make any progress if it can't write to new files. Second, local/storage.go waits for 10% of the retention duration before it begins the first maintenance sweep. This feature request is about the second problem, not the (more difficult) first.

Advising people to make free with rm under the prometheus storage dir makes me queasy, even if it's safe to remove the older series files. For operations work, we need a less-risky procedure for the sleep-deprived pager-carrier who is facing a full disk and a stuck Prometheus.

I want to say: stop Prometheus; delete the orphaned directory to clear up some space; then start it up with a shorter storage.local.retention; and POST to /force-maintenance-sweep; then watch the disk usage drop.

Possible alternatives that I can think of: using a new storage.local.maxbytes limit to stop the problem arising in the first

@aecolley

This comment has been minimized.

Copy link
Author

aecolley commented Aug 6, 2015

(oopsie, silly mobile interface, making the submit button so big and so close to the comment box)

place; or using a "minfree" parameter to tell Prometheus to keep an eye on the filesystem free space and not to get too close to full.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 7, 2015

I could imagine extending our API deletion endpoint to delete certain ranges of time series or simpler older-than samples. This way one could manually and explicitly cut off old time series and then restart with a new retention time. Ideally this would be available through promtool, too.
Certainly cleaner than to adjust the retention time and then wait for an implementation detail to kick in.

This would still need added support from the storage layer, so @beorn7 probably has an opinion on this. He is on vacation though.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 11, 2016

Thinking about this, I'm not sure if catering for the "disk completely full" case makes a lot of sense. Experience tells us that the storage is often in very sorry state if Prometheus runs into a disk-full scenario. You probably cannot shut down cleanly in that state, and the crash recovery will need its own share of disk, so an immediate deletion of time series will only help in the cases where you have just enough disk space for crash recovery but not enough to wait for normal purging...

I was thinking a couple of time about a "storage tool", i.e. a stand alone command line tool that can be used to manipulate and analyze the on-disk storage in cold state.

I think the most helpful feature in Prometheus itself would the suggested flags for keeping a minimum free space on the filesystem and/or limit the maximum size of Prometheus on disk. This is kind of similar to #455 – not trivial to implement but very helpful for easy operations.

@beorn7 beorn7 changed the title Option for immediate deletion of obsolete series on startup Configurable limit for Prometheus's disk space usage Jan 11, 2016

@beorn7 beorn7 self-assigned this Jan 11, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 4, 2017

This will be much easier to implement in v2.0. Since no implementation details have been discussed here yet, we can leave this issue open to track the work for implementing this in 2.0. I will however unassign this issue from myself. This might actually be a nice starter project for a new contributor.

@beorn7 beorn7 removed their assignment Apr 4, 2017

@beorn7 beorn7 added the help wanted label Apr 4, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 4, 2017

I'm wary of this feature, switching from retention time to a disk space limit is going to have the same fundamental operational problems but in different form.

If the goal is an easy way to tactically reduce disk usage while oncall, v2 storage should allow for that.

@fanyangCS

This comment has been minimized.

Copy link

fanyangCS commented Mar 22, 2018

@brian-brazil, how about implement this feature and let user decide which mode to use? storage limit or retention time. I can see both modes make sense for a certain scenario.

@davrodpin

This comment has been minimized.

Copy link

davrodpin commented Mar 22, 2018

@brian-brazil, how about implement this feature and let user decide which mode to use? storage limit or retention time. I can see both modes make sense for a certain scenario.

We've been using Prometheus for more than an year in production and our deployment model consists on independent devices with limited amount of storage limit which is shared among other services as well and our support team doesn't have easy access to them for maintenance.

Having an native option to limit the max. amount of storage Prometheus use would help us a lot.

@SaketMahajani

This comment has been minimized.

Copy link

SaketMahajani commented Mar 22, 2018

I agree this would be quite useful for limited capacity deployments. Is there no way to limit both retention as well as storage instead of either/or ?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 22, 2018

Please don't post metoos in bugs, it causes clutter.

@mtknapp

This comment has been minimized.

Copy link

mtknapp commented Jun 1, 2018

Because this issue is still open I assume there exists some interest in limiting data retention in another way. I've looked into a few ways that data retention could be limited by the amount of data that exists rather than how old the data is like aecolley suggested. Data could be hard capped at a "storage.local.maxbytes" (or possibly limit to a percentage of the drive?), but a few alternatives would be to make sure than there's always still at least N bytes available in the drive or that at least N% of the drive is always unused.

And of course it could always be an options for the user to decide whether or not they want to retain by time or storage used by adding another flag. The additions needed are pretty small and I was looking for some feedback/opinions.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Jun 1, 2018

KISS would imply we try to limit the amount of options. I would personally be happy to have

  • data retention time
  • max storage size
  • optionally max storage percentage, though this introduces operational interdependencies

If more than one of those options is set, it would make sense to connect them with OR as that's the most likely user intent.

@MarkTKnapp would you be willing to create a minimal design doc and then PR for this? It would probably make sense to bounce this discussion off of the dev list and link to this issue from there, and back to the ML archives from here.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 1, 2018

@MarkTKnapp that sound like a good feature to have and it seems it has a fair use cases so definitely start a discussion in the dev mailing list. It has a wider audience so we can get more opinions about the use cases and possible complications.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Jan 25, 2019

This is now solved by #4230

@SuperQ SuperQ closed this Jan 25, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.