on disk replay log for metrics #275

woodsaj · 2016-08-06T10:10:55Z

As metrics are buffered in memory as "chunks" and periodically flushed to cassandra, restarting the metrictank process results in data loss.

Though we plan to solve this by replaying data in kafka, i think it would be great if we could provide a generic solution that works for all input options, specifically carbon ingestion. This will allow single server installations to be able to experience crashes/restarts without experiencing data loss or at least limiting it to a few seconds.

What i propose is writing metrics to disk (an append only log), with a file per chunk window (aka chunkspan). In addition to the log, we keep an on-disk index of all series seen during the chunk window with a flag to indicate if they have been saved or not.

A background task would "compact" the logs once we are half way through the next chunk window. The compaction process would start by reading the index to identify the unsaved chunks. If there are no unsaved chunks then the whole log can be deleted (ideal case), else it is copied to a new file with saved series excluded.

After a restart we would then have at most 1.5 * chunkspan worth of data to replay be streaming from disk and discarding series from the chunk window we know were already saved.

Dieterbe · 2016-08-08T09:11:45Z

seems useful but also quite an undertaking.
I see two solutions to the restarts-without-data-loss-when-not-using-kafka problem: one is a WAL like you describe, the other is just to instruct people to spin up another instance and wait however long needed and then switch primary role, at which point you can stop the old one.
it's arguably a bit more complicated but then again, metrictank is also not aimed to "small-scale" installs.

stale · 2020-04-04T11:39:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

woodsaj mentioned this issue Oct 31, 2016

forceable chunk saving #357

Closed

woodsaj mentioned this issue Jan 18, 2017

Keep track of last chunk save for each metric #452

Closed

stale bot added the stale label Apr 4, 2020

stale bot closed this as completed Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on disk replay log for metrics #275

on disk replay log for metrics #275

woodsaj commented Aug 6, 2016

Dieterbe commented Aug 8, 2016

stale bot commented Apr 4, 2020

on disk replay log for metrics #275

on disk replay log for metrics #275

Comments

woodsaj commented Aug 6, 2016

Dieterbe commented Aug 8, 2016

stale bot commented Apr 4, 2020