Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate RDBMS-via-cdc data into the data lake #7

Open
fabdy opened this issue Nov 3, 2022 · 0 comments
Open

Integrate RDBMS-via-cdc data into the data lake #7

fabdy opened this issue Nov 3, 2022 · 0 comments
Labels
_initial-open-source-release Stuff to do before open sourcing the repo

Comments

@fabdy
Copy link
Contributor

fabdy commented Nov 3, 2022

A very common pattern is probably a PG/MySQL database which holds the transactional data and which should be inputted into the data lake. For that we would like to have a PG DB which regularly gets some data changes (lambda) and streams these changes into the data lake into a raw-raw s3 place (via data DMS). From there we again want to transform these to an event table in parquet. It would also be nice to have a way to get the latest info per table (e.g. a query which uses a primary key and gets the latest row for that).

DoD

  • We have a PG data source which delivers data via cdc into the data lake into the "converted" place
@jankatins jankatins added the _initial-open-source-release Stuff to do before open sourcing the repo label Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
_initial-open-source-release Stuff to do before open sourcing the repo
Projects
None yet
Development

No branches or pull requests

2 participants