Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981

Closed
aaronsteers opened this issue Jun 2, 2022 · 12 comments · Fixed by #6911
Closed

feat: Add support for non-DB state backends (s3, dynamodb, etc.) #5981

aaronsteers opened this issue Jun 2, 2022 · 12 comments · Fixed by #6911

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Jun 2, 2022

Spec discussion:

I think we could introduce state backends such as s3, dynamodb, and other backends that have better reliability than an RDBMS and near-zero always on cost.

In the future, a Meltano-managed state offering, similar to Pulumi's default experience.

Originally posted by @aaronsteers in #2520 (comment)

Additional context:

  1. Currently a 'current' STATE is a composite on-demand scan through history records. This is not ideal in general - and we have Create a new state table in Meltano systemdb #3340 logged to refactor this so a single table row would be the "backend" to read and write from.
  2. Moving to a generic backend store would likely require also solving for Create a new state table in Meltano systemdb #3340 - or at least the same refactoring would (likely) be needed in both cases to eliminate the need for scanning history logs.

Workarounds:

#2520 talks about potential workaround, but basically the current workaround is to:

  1. meltano state get... to pull the latest state into a file.
  2. upload that file to S3 before the container is deleted.
  3. when next run, download the file from S3 and load to the systemdb with meltano state set ...

This works with our without a postgres or other long-lived rdmbs, since the built-in sqlite implementation is created on the fly if no postgresdb is specified, and the process above essentially just removes the long-term state storage requirement from the sqlite backend.

@tayloramurphy
Copy link
Collaborator

The plugin approach discussed in #6270 (comment) I think does make sense to explore more. Particularly in light of efforts with #6130

@rickiesmooth
Copy link

I'm running Meltano in a container aswell, but I'm having a hard time figuring out what the best way is to upload that file to S3 before the container is deleted. There's no on-run-end hook in Meltano right?

@tayloramurphy
Copy link
Collaborator

@rickiesmooth which file are asking about? There's a few ways I could imagine doing this but there's currently no on-run-end hook exactly like there is in dbt.

@rickiesmooth
Copy link

Sorry I was referencing the workaround and how it uploads the state file.

@tayloramurphy
Copy link
Collaborator

@rickiesmooth ah ok - I see now. You could have a tap and target that does that extract for you - it's possible there's already one for reading from a SQLite DB as well.

@rickiesmooth
Copy link

ah that would be nice, for now I just do:

meltano state get dev:tap-google-search-console-to-target-bigquery > meltano_state/gsc_state.json
meltano state get dev:tap-google-analytics-to-target-bigquery > meltano_state/ga_state.json

aws s3 sync meltano_state s3://$S3_BUCKET/meltano_state

after meltano ran

@tayloramurphy
Copy link
Collaborator

@rickiesmooth that's good to know! I think we will be building this into Meltano natively though as it's going to be a pre-req for our future Managed offering.

@tayloramurphy tayloramurphy changed the title Introduce state backends such as s3, dynamodb, etc. feat: Introduce state backends such as s3, dynamodb, etc. Aug 3, 2022
@cjohnhanson
Copy link
Contributor

There have been some synchronous discussions on this--documenting the results of those here.

The v1 of this feature is going to be to use a third-party library (likely smart_open) that will allow us to support users configuring state backends in the form of a simple URI, e.g. s3://some_bucket/some_prefix, where both partial and complete state files can be written. This means we'll also need to implement some barebones locking mechanism. Locking doesn't need to be too sophisticated at first because running the same pipeline using the same backend concurrently in separate deployments should be a pretty rare use case, and those users can still use the existing system db state backend to get more deterministic behavior. Plus, the meltano state commands allow users to manually edit, clear, copy, or merge state to fix any issues that arise from concurrent runs.

Creating state backends is going to require decoupling state from job history, so we'll need to tackle #3340 before getting started on the actual state backend implementation.

This implementation will be done in such a way as to lay the foundation for user contributed state backend plugins in some future iteration, but "pluggable" backends are out of scope for the time being. The URI approach solves for a huge number of use cases and takes us one step closer to eliminating the need for a postgres backend in production deployments without the heavy lift of supporting arbitrary plugins.

@cjohnhanson
Copy link
Contributor

As I've been getting into the weeds on this, it's increasingly become clear that the best approach to this is much more entangled with #3340 than initially thought and a lot of this work will be front-loaded into that PR.

I've also realized that I've spent a lot of time heads down on this without any PRs or touch points so here's a quick brain dump of the current status and timeline along with a more thorough implementation spec to make sure everyone is on the same page about what's being delivered here.

Status and Timeline

Current status is that the refactoring work for this has turned out to touch a lot of the codebase and existing testing suite. Since we're refactoring jobs and state as part of #3340, the PR for that issue will include a basic v1 for the state backend approach, but using systemdb as the first supported state backend. #3340's PR will have rewritten the way we manage state entirely to be basically the same pattern as we use for managing settings. We're rewriting StateService methods to be backend-agnostic and then we'll have a StateStore ABC that state backend implementations will inherit, very similar to how we currently have a unified ProjectSettingsService and PluginSettingsService that mostly wrap methods that are implemented at the SettingsStoreManager level.

Since this is a significant refactor and major new feature, it's difficult to chunk up into small iterations. Current timeline for releasing StateBackends will likely be Iteration 16. It should be ready to go in the first release after that and I'll have a demo ready to go as well.

Implementation Spec

StateService

In rolling out the meltano state command, we refactored state management into its own StateService class. This means that state management logic is already consolidated into a single place in the codebase and we can easily the SQLAlchemy-heavy logic from StateService methods into a dedicated systemdb state backend and then StateService will simply contain logic for determining which state backend to use and then calling the relevant state backend methods. Again, this is very similar to the approach we take for settings management via SettingsService and SettingsStoreManager implementations.

StateStoreManager

We're writing a new StateStoreManager base class which defines abstract methods for getting and setting state (both partial and complete) for various state backends, as well as extremely simple locking logic (basically just acquire and release--implementations will determine timeout logic, deadlock resolution, etc. on a per-backend basis). The first implementation of StateStoreManager is DBStateStoreManager which will largely consist of the existing logic for managing state in the backend DB as currently defined in StateService

Configuring

In this v1, state backend will be configurable at the project level and will be a top-level key in meltano.yml. This is the simplest approach and meets most use cases that have been discussed as motivation for the state backends feature. However, the implementation outlined above will make it relatively easy to quickly update state backends to be configurable at any configuration layer--StateService can simply do SettingsService.get calls to determine what StateStoreManager to use and pass through any necessary config. The it won't be much effort to implement a PluginStateStoreManager that can use wrap Meltano StateBackend extensions when we're ready to add that capability to the edk.

FYI @aaronsteers @tayloramurphy @pandemicsyn

@tayloramurphy
Copy link
Collaborator

@cjohnhanson appreciate the write-up and context setting 👍

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Sep 2, 2022

Since we're refactoring jobs and state as part of #3340, the PR for that issue will include a basic v1 for the state backend approach, but using systemdb as the first supported state backend.

👍 🎯

#3340's PR will have rewritten the way we manage state entirely to be basically the same pattern as we use for managing settings. We're rewriting StateService methods to be backend-agnostic and then we'll have a StateStore ABC that state backend implementations will inherit, very similar to how we currently have a unified ProjectSettingsService and PluginSettingsService that mostly wrap methods that are implemented at the SettingsStoreManager level.

👍

Since this is a significant refactor and major new feature, it's difficult to chunk up into small iterations.

Emphasis on 'small' iterations, yes?

Small or not, does this seem like the right approach to breaking this into chunks?:

Do I have this right @cjohnhanson ?

@aaronsteers aaronsteers changed the title feat: Introduce state backends such as s3, dynamodb, etc. feat: Add support for non-DB state backends (s3, dynamodb, etc.) Sep 2, 2022
@cjohnhanson
Copy link
Contributor

@aaronsteers --

100%, that's exactly the plan.

Third PR. Add --from_backend and --to_backend support to meltano state copy|move CLI commands. (Not logged yet.)

Yeah, I think that makes sense as the next step and should be pretty straightforward to implement after these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants