Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upConsistent backups of a running Prometheus should be possible #651
Comments
This comment has been minimized.
This comment has been minimized.
|
I also reproduced the first result by explicitly waiting for a checkpoint after copying timeseries so heads.db is fresh, and the result is pretty much the same as the first time around. |
This comment has been minimized.
This comment has been minimized.
|
I'd suggest we add a "prometheus snapshot" command to the CLI for that. On Fri, Apr 24, 2015 at 8:30 AM, Matthias Rampke notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
|
After some discussion and thinking, I like most the approach to have an endpoint that returns a giant tarball with a consistent set of series files and the heads.db. In different news, I'm curious to reproduce the state where recovery takes forever. On SSD, I expect the recovery to be reasonably fast always (albeit with possible significant data loss). A case where it takes very long would be very interesting to investigate. |
beorn7
self-assigned this
Apr 26, 2015
beorn7
added
the
feature-request
label
Apr 26, 2015
This comment has been minimized.
This comment has been minimized.
|
Perfect would be if prometheus itself could scrape such a backup endpoint, to allow timeseries aggregation / merging of remote servers to one central one. If I understand the data model correctly, the hash system should let redundant timeseries be merged? |
This comment has been minimized.
This comment has been minimized.
|
@wrouesnel That sounds more like federation. Cf. #9 |
atombender
referenced this issue
Jan 8, 2016
Closed
Isolate corrupted series files instead of panicking. #877
beorn7
referenced this issue
Mar 22, 2016
Closed
deamon doesn't start after crash - leveldb manifest corrupted #1496
fabxc
added
kind/enhancement
and removed
feature request
labels
Apr 28, 2016
This comment has been minimized.
This comment has been minimized.
davidewatson
commented
Jun 9, 2016
|
I have a slightly different use case. Rather than backup the Prometheus data, I want to run benchmarks and then archive the Prometheus data for later analysis. I've been experimenting with stopping Prometheus (using SIGTERM), backing up the underlying storage, and then restarting it. This appears to work reliably. Does anyone see any problems with this approach (assuming that I am OK with gaps in monitoring)? |
This comment has been minimized.
This comment has been minimized.
|
That should be fine although you need to make sure to always start Prometheus with a very high retention so that it does not start deleting old data when you restore a backup for analysis. |
This comment has been minimized.
This comment has been minimized.
|
I think I will take a shot at this one, @beorn7. I´m kinda new to prometheus code base, any pointers for this particular feature? |
This comment has been minimized.
This comment has been minimized.
|
The problem is not so much in the Prometheus codebase but in leveldb. We don't have a good understanding how to take snapshots or anything like that. And then it has to be integrated with the (less problematic) implementation on the Prometheus side. Honestly, I don't thing this is a good starter project to get familiar with the Prometheus code base. It's one of the hairier topics. Also: @fabxc has given us a sneak preview of his new storage design yesterday, see https://twitter.com/juliusvolz/status/822121045658337280 . With that design, hot backups will be trivial to implement. You could as well wait for Prometheus 2.0, which might be using that new design. (And even if @fabxc 's design won't make it into the released Prometheus any time soon, we will need to think about indexing, as we are running into multiple limitations of the current indexing model. So any work done on backing up the current indexing might be moot at the time it's done.) |
This comment has been minimized.
This comment has been minimized.
|
I guess waiting for prometheus 2.0 seems reasonable. I´ve stumbled upon the need of backing up my prometheus data lately and thats why I ended up on this issue. I'll try looking for other issue to work on. |
This comment has been minimized.
This comment has been minimized.
|
@andrestc If you're specifically looking for starter projects, the |
beorn7
added
the
reviewed/won't fix
label
Apr 4, 2017
This comment has been minimized.
This comment has been minimized.
|
Consistent hot backups will be naturally possible with v2.0. This issue is mostly concerned with implementation details in previous versions. An implementation that will never happen. Thus, I'm closing this one. |
beorn7
closed this
Apr 4, 2017
This comment has been minimized.
This comment has been minimized.
|
Issue being tracked at: prometheus/tsdb#4 |
simonpasquier
pushed a commit
to simonpasquier/prometheus
that referenced
this issue
Oct 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
matthiasr commentedApr 24, 2015
I experimented a bit with copying the storage out from underneath a running Prometheus.
In the zeroth run, I tried rsyncing it, but this takes forever because there are so many files. Using tar is much much faster, and since many files change anyway, rsync won't gain you much even on subsequent backups.
In the next run, I copied the whole storage with tar (in this case, directly into the right place on a new server). After that was finished, @beorn7 encouraged me to take another copy of the
. The query is a count of a subset of the timeseries. Presumably, a few chunks were persisted between copying their timeseries and copying
heads.dbto get a fresher snapshot. This snapshot had happened after the copying of the timelines. This is the result:heads.dbso that they were contained in neither on the backup.In a third run, trying to be clever, I used
tar c storage storage/heads.dbso that there is a new copy of heads.db at the end of the tar stream. However, this snapshot is then older than many of the timeseries, and recovery had not finished after >16h and I aborted it.In the fourth run, I did the same but also touch(1)ed the heads.db to shortcircuit the recovery. The result was not so good:
.
As discussed with @beorn7, to support consistent binary backups Prometheus needs to support 3 actions:
These can be API endpoints. With these, a backup procedure would be:
Optionally, Prometheus might offer an endpoint to "just" download a tarball, where it performs these steps behind the scenes.