add cli command 'mara catalog connect' #3

leo-schick · 2023-11-21T13:53:41Z

A command which connects a data lake with a database engine by executing the required SQL commands to tell the db engine where the data is placed on the storage:

(.venv) [add-catalog-connnect-command][~/git/mara/mara-catalog]$ mara catalog connect --help
Usage: mara catalog connect [OPTIONS]

  Connects a data lake or lakehouse catalog to a database

  Args:     catalog: The catalog to connect to. If not set, all configured
  catalogs will be connected.     db_alias: The db alias the catalog shall be
  connected to. If not set, the default db alias is taken.

      disable_colors: If true, don't use escape sequences to make the log
      colorful (default: colorful logging)

Options:
  --catalog TEXT    The catalog to connect. If not given, all catalogs will be
                    connected.
  --db-alias TEXT   The database the catalog(s) shall be connected to. If not
                    given, the default db alias is used.
  --disable-colors  Output logs without coloring them.
  --help            Show this message and exit.

Current supported database engines:

SQL Server Polybase
Azure Synapse
Databricks
Snowflake

Example use case:

You have a data lake with a hadoop-like following folder structure:

/<table_name>/<file_part1...partN].[csv|tsv|json|parquet|ocr|avro]  # typical Hadoop folder structure
/<schema_name>/<table_name>/<file_part1...partN].[csv|tsv|json|parquet|ocr|avro]   # folder structure with schema
/<table_name>/_delta_log    # a delta table (requires pypi package deltalake, not in the default requirements)

e.g.
/sales/invoice/part1.parquet
/sales/invoices/part2.parquet
/sales/invoices/part3.parquet
/sales/orders/part1.parquet
/sales/sales_datamart/part1.parquet
/sales/sales_datamart/part2.parquet

Now you want to use the tables in a database engine. It will require some time for you to create all the metadata in the database engine. With the mara catalog connect command, it can create the required metadata objects so that you can run queryies against the tables:

define in your mara_config.py the catalog:

from mara_app.monkey_patch import patch

from mara_storage.storages import LocalStorage
import mara_storage.config

from mara_catalog.catalog import StorageCatalog
import mara_catalog.config

patch(mara_storage.config.storages)(lambda: {
    'datalake': LocalStorage(base_path='/')
})

patch(mara_db.config.databases)(lambda: {
    'query-engine': SqlcmdSQLServerDB(...),
})

patch(mara_catalog.config.catalogs)(lambda: {
    'sales_data': StorageCatalog(storage_alias='datalake', base_path='sales', default_schema='sales'),
    # or 
    #'sales_data': StorageCatalog(storage_alias='datalake', base_path='.', has_schemas=True),
})

Now you just need to run:

mara catalog connect --catalog sales_data --db-alias query-engine

The command will

discover all tables in your data lake base_path
if required, it detects automatically the schema.
create a pipeline with SQL commands to 1. create the schema if it does not exist and then 2. run SQL commands to tell the db engine metadata where to find the tables and with which columns (if required)
run the pipeline

Currently, the command runs in a create_or_replace mode which is typical for dbt. A typical mara mode woudl be replace_schema which is planned.

leo-schick · 2023-12-07T16:30:23Z

@jankatins maybe you cold throw a quick look a this. Final documentation is still missing but from the description and code it should be easy to get an idea about it. Just if you'd like

leo-schick · 2023-12-07T16:35:21Z

This command is extremely helpful when setting up new environments or when you want to query your data lake with different query engines. PostgreSQL could be supported with e.g. parquet_fdw; but is out of scope for me. Maybe someone else whats to implement it...

jankatins · 2023-12-07T17:10:24Z

I tried to understand what this does and I still am a bit clueless here: If I understand the above right, it basically does some magic to discover datasets from files and sets up a pipeline to copy it over?

If that's the case, I feel that this is mixing a few concepts: I know in aws glue, a catalog is something which describes datasets/ables which are (mostly) parquet files in s3 (so basically files are represented as a "DB"), so from "connect" I would expect that it can connect to that (already existing) AWS glue catalog. I would also expect that "connect" does not do any actions.

create catalog (creates an empty catalog)
catalog: discover from filesystem/ ... (basically the pedant to an AWS glue crawler)
connect to (external) catalog (to catalogs which exist in other systems, e.g. AWS glue) -> loads the metadata and makes all table available via the catalog interface, which would be kind of a DB?
Setup forwarders in one DB to the catalog tables?

This assumes that a pipeline with SQL commands to 1. create the schema if it does not exist and then 2. run SQL commands to tell the db engine metadata where to find the tables and with which columns (if required) is only about metadata and not about real data getting copied from somewhere to somewhere else. I would find it strange to have this kind of thing as a pipeline (which I can (or cannot?) see in the UI?) and not as a simple task or even just as a cli command.

So all in all, I still do not get what this is about because it doesn't adhere (or at least seems to not adhere) to the concepts I know from AWS glue catalogs.

leo-schick · 2023-12-08T12:48:36Z

@jankatins
The name connect is indeed confusing but i did not come up yet with another name for this function. And yes, you seem to be a bit confused :-)

First of all, what is a catalog
A catalog is a mara representation of table (or "model") metadata. Currently without columns. This could be:

a StorageCatalog (see above) - repesents the tables from a data lake or lakehouse. (Note that auto discover of tables is optional. You could decide to specify in code all tables you want to have in the catalog, like you have to specify all entities in the mara-schema package)
a DbCatalog - repesenting table metadata in a database
a DbtDbCatalog/DbtStorageCatalog - representing the available models in a dbt project, taken from a manifest file

It is not decided what you want to do with this catalog information at this point. Here some ideas:

show the catalog in the UI
cache/load a "discovered" catalog from disk, e.g. in a JSON file
you want to do a dbt fashion like mara catalog run -s <catalog_name.model> command which can build your models. (woudl require dependency management + a sub pipeline per model).

The catalog itself is - first of all - just a python code representation of table metadata.

What is this function?
It is a helper function to inform a database system about where to find the models (so: creating/updating the metadata in the db engine). It is developed for only storage catalogs.

Just think of the dbt package dbt-external-tables:
There you speicify in yml files how a souce table looks like. Then you have the option to call dbt run-operation stage_external_sources to create these tables in the database engine so that you can use them.

In my case above, I use the "crawler function" to auto-discover the tables from the storage. This is a nice feature since many data bases do not support a crawler like AWS glue. Since I there is not yet any caching, I put this into a single command. But the main feature of this command is to update the metadata of the db engine (e.g. calling CREATE OR REPLACE EXTERNAL TABLE). The table discovery is just a feature on top which is given by this mara module.

add cli command 'mara catalog connect'

c5290d1

leo-schick force-pushed the add-catalog-connnect-command branch from e991150 to c5290d1 Compare November 21, 2023 13:55

leo-schick added 2 commits December 7, 2023 16:41

finalize first version of mara catalog connect

7b2efc4

improve code

2752f78

leo-schick requested a review from jankatins December 7, 2023 16:30

use inspection mkleehammer/pyodbc#1312

729f719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cli command 'mara catalog connect' #3

add cli command 'mara catalog connect' #3

leo-schick commented Nov 21, 2023 •

edited

Loading

leo-schick commented Dec 7, 2023

leo-schick commented Dec 7, 2023

jankatins commented Dec 7, 2023 •

edited

Loading

leo-schick commented Dec 8, 2023

add cli command 'mara catalog connect' #3

Are you sure you want to change the base?

add cli command 'mara catalog connect' #3

Conversation

leo-schick commented Nov 21, 2023 • edited Loading

Example use case:

leo-schick commented Dec 7, 2023

leo-schick commented Dec 7, 2023

jankatins commented Dec 7, 2023 • edited Loading

leo-schick commented Dec 8, 2023

leo-schick commented Nov 21, 2023 •

edited

Loading

jankatins commented Dec 7, 2023 •

edited

Loading