feat: Add support for non-DB state backends (`s3`, `dynamodb`, etc.) #5981

aaronsteers · 2022-06-02T16:24:07Z

Spec discussion:

Spec discussion: Config layer and CLI interface for alternate state backends #6270

I think we could introduce state backends such as s3, dynamodb, and other backends that have better reliability than an RDBMS and near-zero always on cost.

In the future, a Meltano-managed state offering, similar to Pulumi's default experience.

Originally posted by @aaronsteers in #2520 (comment)

Additional context:

Currently a 'current' STATE is a composite on-demand scan through history records. This is not ideal in general - and we have Create a new state table in Meltano systemdb #3340 logged to refactor this so a single table row would be the "backend" to read and write from.
Moving to a generic backend store would likely require also solving for Create a new state table in Meltano systemdb #3340 - or at least the same refactoring would (likely) be needed in both cases to eliminate the need for scanning history logs.

Workarounds:

#2520 talks about potential workaround, but basically the current workaround is to:

meltano state get... to pull the latest state into a file.
upload that file to S3 before the container is deleted.
when next run, download the file from S3 and load to the systemdb with meltano state set ...

This works with our without a postgres or other long-lived rdmbs, since the built-in sqlite implementation is created on the fly if no postgresdb is specified, and the process above essentially just removes the long-term state storage requirement from the sqlite backend.

The text was updated successfully, but these errors were encountered:

tayloramurphy · 2022-07-13T21:25:13Z

The plugin approach discussed in #6270 (comment) I think does make sense to explore more. Particularly in light of efforts with #6130

rickiesmooth · 2022-07-14T15:30:53Z

I'm running Meltano in a container aswell, but I'm having a hard time figuring out what the best way is to upload that file to S3 before the container is deleted. There's no on-run-end hook in Meltano right?

tayloramurphy · 2022-07-14T20:30:21Z

@rickiesmooth which file are asking about? There's a few ways I could imagine doing this but there's currently no on-run-end hook exactly like there is in dbt.

rickiesmooth · 2022-07-15T07:41:23Z

Sorry I was referencing the workaround and how it uploads the state file.

tayloramurphy · 2022-07-15T13:15:19Z

@rickiesmooth ah ok - I see now. You could have a tap and target that does that extract for you - it's possible there's already one for reading from a SQLite DB as well.

rickiesmooth · 2022-07-15T13:46:37Z

ah that would be nice, for now I just do:

meltano state get dev:tap-google-search-console-to-target-bigquery > meltano_state/gsc_state.json
meltano state get dev:tap-google-analytics-to-target-bigquery > meltano_state/ga_state.json

aws s3 sync meltano_state s3://$S3_BUCKET/meltano_state

after meltano ran

tayloramurphy · 2022-07-15T17:47:17Z

@rickiesmooth that's good to know! I think we will be building this into Meltano natively though as it's going to be a pre-req for our future Managed offering.

cjohnhanson · 2022-08-11T21:24:56Z

There have been some synchronous discussions on this--documenting the results of those here.

The v1 of this feature is going to be to use a third-party library (likely smart_open) that will allow us to support users configuring state backends in the form of a simple URI, e.g. s3://some_bucket/some_prefix, where both partial and complete state files can be written. This means we'll also need to implement some barebones locking mechanism. Locking doesn't need to be too sophisticated at first because running the same pipeline using the same backend concurrently in separate deployments should be a pretty rare use case, and those users can still use the existing system db state backend to get more deterministic behavior. Plus, the meltano state commands allow users to manually edit, clear, copy, or merge state to fix any issues that arise from concurrent runs.

Creating state backends is going to require decoupling state from job history, so we'll need to tackle #3340 before getting started on the actual state backend implementation.

This implementation will be done in such a way as to lay the foundation for user contributed state backend plugins in some future iteration, but "pluggable" backends are out of scope for the time being. The URI approach solves for a huge number of use cases and takes us one step closer to eliminating the need for a postgres backend in production deployments without the heavy lift of supporting arbitrary plugins.

cjohnhanson · 2022-09-01T20:17:33Z

As I've been getting into the weeds on this, it's increasingly become clear that the best approach to this is much more entangled with #3340 than initially thought and a lot of this work will be front-loaded into that PR.

I've also realized that I've spent a lot of time heads down on this without any PRs or touch points so here's a quick brain dump of the current status and timeline along with a more thorough implementation spec to make sure everyone is on the same page about what's being delivered here.

Status and Timeline

Current status is that the refactoring work for this has turned out to touch a lot of the codebase and existing testing suite. Since we're refactoring jobs and state as part of #3340, the PR for that issue will include a basic v1 for the state backend approach, but using systemdb as the first supported state backend. #3340's PR will have rewritten the way we manage state entirely to be basically the same pattern as we use for managing settings. We're rewriting StateService methods to be backend-agnostic and then we'll have a StateStore ABC that state backend implementations will inherit, very similar to how we currently have a unified ProjectSettingsService and PluginSettingsService that mostly wrap methods that are implemented at the SettingsStoreManager level.

Since this is a significant refactor and major new feature, it's difficult to chunk up into small iterations. Current timeline for releasing StateBackends will likely be Iteration 16. It should be ready to go in the first release after that and I'll have a demo ready to go as well.

Implementation Spec

StateService

In rolling out the meltano state command, we refactored state management into its own StateService class. This means that state management logic is already consolidated into a single place in the codebase and we can easily the SQLAlchemy-heavy logic from StateService methods into a dedicated systemdb state backend and then StateService will simply contain logic for determining which state backend to use and then calling the relevant state backend methods. Again, this is very similar to the approach we take for settings management via SettingsService and SettingsStoreManager implementations.

StateStoreManager

We're writing a new StateStoreManager base class which defines abstract methods for getting and setting state (both partial and complete) for various state backends, as well as extremely simple locking logic (basically just acquire and release--implementations will determine timeout logic, deadlock resolution, etc. on a per-backend basis). The first implementation of StateStoreManager is DBStateStoreManager which will largely consist of the existing logic for managing state in the backend DB as currently defined in StateService

Configuring

In this v1, state backend will be configurable at the project level and will be a top-level key in meltano.yml. This is the simplest approach and meets most use cases that have been discussed as motivation for the state backends feature. However, the implementation outlined above will make it relatively easy to quickly update state backends to be configurable at any configuration layer--StateService can simply do SettingsService.get calls to determine what StateStoreManager to use and pass through any necessary config. The it won't be much effort to implement a PluginStateStoreManager that can use wrap Meltano StateBackend extensions when we're ready to add that capability to the edk.

FYI @aaronsteers @tayloramurphy @pandemicsyn

tayloramurphy · 2022-09-01T21:25:02Z

@cjohnhanson appreciate the write-up and context setting 👍

aaronsteers · 2022-09-02T00:49:57Z

Since we're refactoring jobs and state as part of #3340, the PR for that issue will include a basic v1 for the state backend approach, but using systemdb as the first supported state backend.

👍 🎯

#3340's PR will have rewritten the way we manage state entirely to be basically the same pattern as we use for managing settings. We're rewriting StateService methods to be backend-agnostic and then we'll have a StateStore ABC that state backend implementations will inherit, very similar to how we currently have a unified ProjectSettingsService and PluginSettingsService that mostly wrap methods that are implemented at the SettingsStoreManager level.

👍

Since this is a significant refactor and major new feature, it's difficult to chunk up into small iterations.

Emphasis on 'small' iterations, yes?

Small or not, does this seem like the right approach to breaking this into chunks?:

First PR. Launch the first state backend, which would be systemdb and the new state table in Create a new state table in Meltano systemdb #3340. Doesn't need configuration. (Would track/rework in Create a new state table in Meltano systemdb #3340 as a dependency for external backends discussed here.)
Second PR. Add second state backend, for instance based on PyFilesystem or smart_open. Should include method of configuring. (Can use this issue, aka feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981.)
Third PR. Add --from_backend and --to_backend support to meltano state copy|move CLI commands. (Not logged yet.)

Do I have this right @cjohnhanson ?

cjohnhanson · 2022-09-02T13:32:07Z

@aaronsteers --

First PR. Launch the first state backend, which would be systemdb and the new state table in Create a new state table in Meltano systemdb #3340. Doesn't need configuration. (Would track/rework in Create a new state table in Meltano systemdb #3340 as a dependency for external backends discussed here.)

Second PR. Add second state backend, for instance based on PyFilesystem or smart_open. Should include method of configuring. (Can use this issue, aka feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981.)

100%, that's exactly the plan.

Third PR. Add --from_backend and --to_backend support to meltano state copy|move CLI commands. (Not logged yet.)

Yeah, I think that makes sense as the next step and should be pretty straightforward to implement after these changes.

aaronsteers added the Accepting Pull Requests label Jun 2, 2022

aaronsteers mentioned this issue Jun 2, 2022

Document how to manage incremental replication state without a persistent system database #2520

Closed

tayloramurphy added kind/Feature valuestream/Meltano roadmap labels Jun 23, 2022

labelsync-manager bot added the kind/Feature label Jun 23, 2022

tayloramurphy removed the kind/Feature label Jun 24, 2022

tayloramurphy mentioned this issue Jun 24, 2022

Refresh catalog on every invoke (fresh_catalog: true) #2848

Closed

aaronsteers mentioned this issue Jul 26, 2022

Support pluggable Settings Backends and Secrets Backends #2859

Closed

tayloramurphy changed the title ~~Introduce state backends such as s3, dynamodb, etc.~~ feat: Introduce state backends such as s3, dynamodb, etc. Aug 3, 2022

aaronsteers assigned cjohnhanson Aug 9, 2022

aaronsteers mentioned this issue Aug 10, 2022

Create a new state table in Meltano systemdb #3340

Closed

aaronsteers added this to the Eliminate `systemdb` RDBMS reliance milestone Aug 27, 2022

aaronsteers changed the title ~~feat: Introduce state backends such as s3, dynamodb, etc.~~ feat: Add support for non-DB state backends (s3, dynamodb, etc.) Sep 2, 2022

cjohnhanson mentioned this issue Sep 13, 2022

chore: Support "state backends" beginning with systemdb #6742

Merged

cjohnhanson mentioned this issue Nov 8, 2022

feat: remote state backends #6911

Merged

cjohnhanson closed this as completed in #6911 Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for non-DB state backends (`s3`, `dynamodb`, etc.) #5981

feat: Add support for non-DB state backends (`s3`, `dynamodb`, etc.) #5981

aaronsteers commented Jun 2, 2022 •

edited

Loading

tayloramurphy commented Jul 13, 2022

rickiesmooth commented Jul 14, 2022

tayloramurphy commented Jul 14, 2022

rickiesmooth commented Jul 15, 2022

tayloramurphy commented Jul 15, 2022

rickiesmooth commented Jul 15, 2022

tayloramurphy commented Jul 15, 2022

cjohnhanson commented Aug 11, 2022

cjohnhanson commented Sep 1, 2022

tayloramurphy commented Sep 1, 2022

aaronsteers commented Sep 2, 2022 •

edited

Loading

cjohnhanson commented Sep 2, 2022

feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981

feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981

Comments

aaronsteers commented Jun 2, 2022 • edited Loading

Additional context:

Workarounds:

tayloramurphy commented Jul 13, 2022

rickiesmooth commented Jul 14, 2022

tayloramurphy commented Jul 14, 2022

rickiesmooth commented Jul 15, 2022

tayloramurphy commented Jul 15, 2022

rickiesmooth commented Jul 15, 2022

tayloramurphy commented Jul 15, 2022

cjohnhanson commented Aug 11, 2022

cjohnhanson commented Sep 1, 2022

Status and Timeline

Implementation Spec

StateService

StateStoreManager

Configuring

tayloramurphy commented Sep 1, 2022

aaronsteers commented Sep 2, 2022 • edited Loading

cjohnhanson commented Sep 2, 2022

feat: Add support for non-DB state backends (`s3`, `dynamodb`, etc.) #5981

feat: Add support for non-DB state backends (`s3`, `dynamodb`, etc.) #5981

aaronsteers commented Jun 2, 2022 •

edited

Loading

aaronsteers commented Sep 2, 2022 •

edited

Loading