-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cli command 'mara catalog connect' #3
base: main
Are you sure you want to change the base?
Conversation
e991150
to
c5290d1
Compare
@jankatins maybe you cold throw a quick look a this. Final documentation is still missing but from the description and code it should be easy to get an idea about it. Just if you'd like |
This command is extremely helpful when setting up new environments or when you want to query your data lake with different query engines. PostgreSQL could be supported with e.g. parquet_fdw; but is out of scope for me. Maybe someone else whats to implement it... |
I tried to understand what this does and I still am a bit clueless here: If I understand the above right, it basically does some magic to discover datasets from files and sets up a pipeline to copy it over? If that's the case, I feel that this is mixing a few concepts: I know in aws glue, a catalog is something which describes datasets/ables which are (mostly) parquet files in s3 (so basically files are represented as a "DB"), so from "connect" I would expect that it can connect to that (already existing) AWS glue catalog. I would also expect that "connect" does not do any actions.
This assumes that So all in all, I still do not get what this is about because it doesn't adhere (or at least seems to not adhere) to the concepts I know from AWS glue catalogs. |
@jankatins First of all, what is a catalog
It is not decided what you want to do with this catalog information at this point. Here some ideas:
The catalog itself is - first of all - just a python code representation of table metadata. What is this function? Just think of the dbt package dbt-external-tables: In my case above, I use the "crawler function" to auto-discover the tables from the storage. This is a nice feature since many data bases do not support a crawler like AWS glue. Since I there is not yet any caching, I put this into a single command. But the main feature of this command is to update the metadata of the db engine (e.g. calling |
A command which connects a data lake with a database engine by executing the required SQL commands to tell the db engine where the data is placed on the storage:
Current supported database engines:
Example use case:
You have a data lake with a hadoop-like following folder structure:
Now you want to use the tables in a database engine. It will require some time for you to create all the metadata in the database engine. With the
mara catalog connect
command, it can create the required metadata objects so that you can run queryies against the tables:define in your
mara_config.py
the catalog:Now you just need to run:
The command will
base_path
Currently, the command runs in a
create_or_replace
mode which is typical for dbt. A typical mara mode woudl bereplace_schema
which is planned.