Standalone backfill tool #1984

shanson7 · 2021-06-09T08:11:48Z

We have been using this tool to (manually) backfill time-series data from users. It is non-trivial and a little finicky.

The time-series must already exist in the Metrictank index (this tool purposefully does not add them)
Needs it's own kafka instance (we just create a temporary one in k8s and destroy it when complete)
It's cassandra store specific

While this tool is functional for our purposes, I'm not quite sure if it's ready for upstreaming or what changes we might want to make to make it work.

Dieterbe · 2021-07-07T15:44:31Z

Can you explain what problem does this tool solve?
I presume that the reason for this tool's existence is to simply work around metrictank's limitation of enforcing that data is newer than the last seen point (for the series it has in index anyway)
We do something similar in the mt-whisper-importer-writer also (which doesn't have anything to do with whisper anymore)

shanson7 · 2021-07-07T16:17:02Z

I presume that the reason for this tool's existence is to simply work around metrictank's limitation of enforcing that data is newer than the last seen point (for the series it has in index anyway)

Precisely. It is used as part of an out of band process to selectively fill in older data (primarily from legacy systems) for which more recent data already exists in Metrictank.

Dieterbe · 2021-07-07T16:25:38Z

I'm wondering how we can avoid code duplication, since we already have mt-whisper-importer-writer which listens on http for chunk write requests. If the client can accommodate a slow or temporary down backend, then it can simply wait and retry the http posts (this is what mt-whisper-importer-reader does), and you don't need a queue in between.

shanson7 · 2021-07-09T11:08:50Z

I'm wondering how we can avoid code duplication, since we already have mt-whisper-importer-writer which listens on http for chunk write requests. If the client can accommodate a slow or temporary down backend, then it can simply wait and retry the http posts (this is what mt-whisper-importer-reader does), and you don't need a queue in between.

Not sure what you are implying here. Do you mean kafka when you say "a queue in between"? In our case we have a custom task that knows how to take some input and create the datapoints that MT expects. It does not know how to create the chunks (and I don't believe it should). So I still think we need some tool to create the chunks, which honestly is a large part of the work here.

We use this tool out-of-band. We just spin up a pod with 3 containers (our custom process, kafka, and mt-backfill) and feed in the data to backfill. Realistically, it's only a few hundred lines of code, and reuses a lot of the primary MT code. If you see some more to factor out to reduce the code footprint, I'd be open to that. The reason I want to upstream this is so it stays current if, for example, an interface changes.

Dieterbe · 2021-07-13T11:55:07Z

Do you mean kafka when you say "a queue in between"?

yes

. It does not know how to create the chunks (and I don't believe it should). So I still think we need some tool to create the chunks, which honestly is a large part of the work here.

I see.
It seems then that for you, writing data into kafka is low/zero cost as you probably use the same code you already had.
If we want to make this a more generally useful tool, then we should consider that for others, the tooling that yields old metrics may not as easily reuse the same kafka producer code that is being used for the realtime metrics (although, something like carbon-relay-ng can be useful here too, it can take in data over carbon tcp and write to kafka, or mt-gateway can do the same with an http input)

I was thinking of ways to integrate this more into the existing tools rather introduce yet another one-off. Some ideas:

could we use MT itself instead of mt-backfill? To be clear, if your goal is to "initate a backfill" (go back in time once, for each metric that is being backfilled), but from then on, send series in order, and not to go back in time multiple times (e.g. send data out of order during the backfill)? So running a fresh/separate metrictank should also bypass that limitation (and you could also run it with memory index), or am i missing something?
move chunk building code into mt-whisper-importer-writer. you can then send raw points to it and don't "need" kafka, e.g. we could use carbon tcp or the same http endpoint that mt-gateway uses, but i don't like this, this wouldn't save us much.

Dieterbe · 2021-07-28T13:14:52Z

@shanson7 what do you think of option 1 above? am i missing something?

shanson7 · 2021-08-11T15:30:14Z

That is essentially what this tool is doing, but stripping out bits that aren't needed and automatically shutting down when all data is processed. It's intended for automated runs.

Dieterbe · 2021-08-12T09:09:12Z

OK then that's fine. I think we mainly need docs then:

this can just be a preamble in the --help output (that way it'll also go into the docs automatically) explaining what the tool is for, how/when to use it, what the caveats and tricky bits are, etc
the doc should also mention that it only supports cassandra, and not bigtable.
may want to rename the tool to mt-backfill-experimental, i'll leave that up to you, how much trust you have in it. I personally don't plan to test this tool. (I note mt-store-cp-experimental is also still experimental)

also remove the "WIP" from the title since it seems it's no longer a wip :)

Dieterbe · 2021-08-30T10:10:47Z

@shanson7 just wanted to confirm you're not waiting on anything from me for this one. this is close to finished. see above.

shanson7 force-pushed the backfill_20210416 branch from a57819b to d15a8bd Compare June 9, 2021 11:39

shanson7 added 6 commits August 31, 2021 11:36

Add backfill tool

3d443bc

Fix/add logging

356e062

ignore mt-backfill

2be15d5

Add more usage details

e80d6a3

Update interface

eade2bb

Update backfill in docs/tools

4530ebc

shanson7 force-pushed the backfill_20210416 branch from d15a8bd to 4530ebc Compare August 31, 2021 10:45

shanson7 changed the title ~~WIP - standalone backfill tool~~ Standalone backfill tool Aug 31, 2021

Sort .gitignore

bb2e8fc

Dieterbe approved these changes Aug 31, 2021

View reviewed changes

Dieterbe merged commit 6c13872 into grafana:master Aug 31, 2021

shanson7 deleted the backfill_20210416 branch August 31, 2021 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone backfill tool #1984

Standalone backfill tool #1984

shanson7 commented Jun 9, 2021

Dieterbe commented Jul 7, 2021

shanson7 commented Jul 7, 2021

Dieterbe commented Jul 7, 2021

shanson7 commented Jul 9, 2021

Dieterbe commented Jul 13, 2021 •

edited

Dieterbe commented Jul 28, 2021

shanson7 commented Aug 11, 2021

Dieterbe commented Aug 12, 2021 •

edited

Dieterbe commented Aug 30, 2021

Standalone backfill tool #1984

Standalone backfill tool #1984

Conversation

shanson7 commented Jun 9, 2021

Dieterbe commented Jul 7, 2021

shanson7 commented Jul 7, 2021

Dieterbe commented Jul 7, 2021

shanson7 commented Jul 9, 2021

Dieterbe commented Jul 13, 2021 • edited

Dieterbe commented Jul 28, 2021

shanson7 commented Aug 11, 2021

Dieterbe commented Aug 12, 2021 • edited

Dieterbe commented Aug 30, 2021

Dieterbe commented Jul 13, 2021 •

edited

Dieterbe commented Aug 12, 2021 •

edited