# Notebook Description

This notebook is designed to onboard and manage metadata for external database tables in Databricks. It provides a workflow to select tables for onboarding, retrieve their metadata, and prepare JDBC connection properties for various RDBMS types.

## Workflow Overview

1. **Parameter Input via Widgets**:  
   Users specify the target table IDs, run ID (optional), and schema using Databricks widgets.

2. **Metadata Management**:  
   The notebook loads metadata from three tables: `connection_metadata`, `table_metadata`, and `column_metadata`.  
   - `connection_metadata`: Contains connection details for various databases.
   - `table_metadata`: Contains metadata for tables, including onboarding status.
   - `column_metadata`: Contains column-level metadata.

3. **Table Selection**:  
   The notebook filters tables based on user input and onboarding status, ensuring only eligible tables are processed.

4. **Connection Metadata Retrieval**:  
   For each selected table, the corresponding connection metadata is retrieved.

5. **JDBC Driver Installation**:  
   The notebook dynamically installs the required JDBC driver for the detected RDBMS type using `%pip`.

6. **JDBC URL and Properties Construction**:  
   It constructs the appropriate JDBC URL and connection properties for the target database, supporting multiple RDBMS types (Oracle, SQL Server, PostgreSQL, MySQL, MariaDB, Snowflake, BigQuery, Redshift, DB2, SAP HANA).

## Key Features

- **Dynamic RDBMS Support**:  
  The notebook supports a wide range of RDBMS platforms, automatically handling driver installation and connection string formatting.

- **Metadata-Driven**:  
  All operations are driven by metadata tables, enabling flexible and scalable onboarding workflows.

- **Widget-Driven Parameterization**:  
  Users can easily control the onboarding process through Databricks widgets.

## Usage

1. Set the `table_ids` widget to a comma-separated list of table IDs to onboard.
2. Optionally set the `run_id` widget, or leave blank to auto-generate.
3. Set the `schema` widget as needed.
4. Run the notebook cells sequentially.

This notebook is intended for data engineers and platform administrators responsible for onboarding and managing external data sources in Databricks.

In [0]:
notebook_description = """
# Notebook Description

This notebook is designed to onboard and manage metadata for external database tables in Databricks. It provides a workflow to select tables for onboarding, retrieve their metadata, and prepare JDBC connection properties for various RDBMS types.

## Workflow Overview

1. **Parameter Input via Widgets**:  
   Users specify the target table IDs, run ID (optional), and schema using Databricks widgets.

2. **Metadata Management**:  
   The notebook loads metadata from three tables: `connection_metadata`, `table_metadata`, and `column_metadata`.  
   - `connection_metadata`: Contains connection details for various databases.
   - `table_metadata`: Contains metadata for tables, including onboarding status.
   - `column_metadata`: Contains column-level metadata.

3. **Table Selection**:  
   The notebook filters tables based on user input and onboarding status, ensuring only eligible tables are processed.

4. **Connection Metadata Retrieval**:  
   For each selected table, the corresponding connection metadata is retrieved.

5. **JDBC Driver Installation**:  
   The notebook dynamically installs the required JDBC driver for the detected RDBMS type using `%pip`.

6. **JDBC URL and Properties Construction**:  
   It constructs the appropriate JDBC URL and connection properties for the target database, supporting multiple RDBMS types (Oracle, SQL Server, PostgreSQL, MySQL, MariaDB, Snowflake, BigQuery, Redshift, DB2, SAP HANA).

## Key Features

- **Dynamic RDBMS Support**:  
  The notebook supports a wide range of RDBMS platforms, automatically handling driver installation and connection string formatting.

- **Metadata-Driven**:  
  All operations are driven by metadata tables, enabling flexible and scalable onboarding workflows.

- **Widget-Driven Parameterization**:  
  Users can easily control the onboarding process through Databricks widgets.

## Usage

1. Set the `table_ids` widget to a comma-separated list of table IDs to onboard.
2. Optionally set the `run_id` widget, or leave blank to auto-generate.
3. Set the `schema` widget as needed.
4. Run the notebook cells sequentially.

This notebook is intended for data engineers and platform administrators responsible for onboarding and managing external data sources in Databricks.
"""
displayHTML(f"<div style='white-space: pre-wrap'>{notebook_description}</div>")

In [0]:
from pyspark.sql import SparkSession, DataFrame
import uuid

In [0]:
# Databricks widgets for table_ids and run_id
dbutils.widgets.text("table_ids", "", "Table IDs (comma-separated)")
dbutils.widgets.text("run_id", "", "Run ID (leave blank to auto-generate)")
dbutils.widgets.text("schema", "")
table_ids_param = dbutils.widgets.get("table_ids")
run_id_param = dbutils.widgets.get("run_id")
table_ids = [tid.strip() for tid in table_ids_param.split(",") if tid.strip()]
schema = dbutils.widgets.get("schema")
run_id = run_id_param if run_id_param else str(uuid.uuid4())

In [0]:
class MetadataManager:
    def __init__(self, spark: SparkSession):
        self.spark = spark

    def load_connection_metadata(self) -> DataFrame:
        return self.spark.table('connection_metadata')

    def load_table_metadata(self) -> DataFrame:
        return self.spark.table('table_metadata')

In [0]:
spark = SparkSession.builder.getOrCreate()
metadata = MetadataManager(spark)

In [0]:
# Load metadata tables
table_meta = metadata.load_table_metadata()
conn_meta = metadata.load_connection_metadata()
col_meta = metadata.spark.table('column_metadata')


In [0]:
from pyspark.sql.functions import when, col

selected_tables = table_meta.withColumn(
    "onboarded_flag",
    when(col("onboarded_flag").isNull(), "Y").otherwise(col("onboarded_flag"))
).filter(
    (col("table_id").isin(table_ids)) & (col("onboarded_flag") != 'N')
).collect()

In [0]:
print(selected_tables)

[Row(table_id='conn_postgres_chinook_Customer', connection_id='conn_postgres_chinook', table_name='Customer', target_table_name='stg_Customer', table_type='full', primary_key_columns='None', watermark_column='None', partition_column='None', target_path='/mnt/datalake/stg_Customer', load_frequency='daily', active_flag='Y', comments='Auto-populated', optimize_zorder_by='None', repartition_columns='None', num_output_files='None', write_mode='overwrite', cache_intermediate='False', target_db='default', onboarded_flag='Y', table_call_name='"public"."Customer"')]


In [0]:
for row in selected_tables:
    conn = conn_meta.filter(conn_meta.connection_id == row.connection_id).collect()[0]
    print(conn)

Row(connection_id='conn_postgres_chinook', type='postgresql', host='ep-sweet-snow-aeztchbb-pooler.c-2.us-east-2.aws.neon.tech', port=5432, database='chinook', schema='public', username='neondb_owner', password='npg_7Bd4JRTiqnox', options='')


In [0]:
if conn.type == "oracle":
    %pip install cx_Oracle
elif conn.type == "sqlserver":
    %pip install pyodbc
elif conn.type == "postgresql":
    %pip install psycopg2-binary
elif conn.type == "mysql":
    %pip install mysql-connector-python
elif conn.type == "mariadb":
    %pip install mariadb
elif conn.type == "snowflake":
    %pip install snowflake-connector-python
elif conn.type == "bigquery":
    %pip install google-cloud-bigquery
elif conn.type == "redshift":
    %pip install redshift-connector
elif conn.type == "db2":
    %pip install ibm-db
elif conn.type == "hana":
    %pip install hdbcli
else:
    raise Exception(f"Unsupported RDBMS type: {conn.type}")

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
jdbc_url = None
connection_properties = {}

if conn.type == "oracle":
    jdbc_url = f"jdbc:oracle:thin:@{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "oracle.jdbc.OracleDriver",
        "currentSchema": conn.schema
    }
elif conn.type == "sqlserver":
    jdbc_url = f"jdbc:sqlserver://{conn.host}:{conn.port};databaseName={conn.database};schema={conn.schema}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
        "schema": conn.schema
    }
elif conn.type == "postgresql":
    jdbc_url = f"jdbc:postgresql://{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "org.postgresql.Driver",
        "currentSchema": conn.schema
    }
elif conn.type == "mysql":
    jdbc_url = f"jdbc:mysql://{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.mysql.cj.jdbc.Driver",
        "schema": conn.schema
    }
elif conn.type == "mariadb":
    jdbc_url = f"jdbc:mariadb://{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "org.mariadb.jdbc.Driver",
        "schema": conn.schema
    }
elif conn.type == "snowflake":
    jdbc_url = f"jdbc:snowflake://{conn.host}/?db={conn.database}&schema={conn.schema}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "net.snowflake.client.jdbc.SnowflakeDriver",
        "schema": conn.schema
    }
elif conn.type == "bigquery":
    jdbc_url = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId={conn.database};DefaultDataset={conn.schema};"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.simba.googlebigquery.jdbc42.Driver",
        "DefaultDataset": conn.schema
    }
elif conn.type == "redshift":
    jdbc_url = f"jdbc:redshift://{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.amazon.redshift.jdbc.Driver",
        "currentSchema": conn.schema
    }
elif conn.type == "db2":
    jdbc_url = f"jdbc:db2://{conn.host}:{conn.port}/{conn.database}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.ibm.db2.jcc.DB2Driver",
        "currentSchema": conn.schema
    }
elif conn.type == "hana":
    jdbc_url = f"jdbc:sap://{conn.host}:{conn.port}"
    connection_properties = {
        "user": conn.username,
        "password": conn.password,
        "driver": "com.sap.db.jdbc.Driver",
        "currentSchema": conn.schema
    }
else:
    raise Exception(f"Unsupported RDBMS type: {conn.type}")

In [0]:
def write_to_delta(df, mode, target_table, run_id, connection_id, table_id, table_name):
    from datetime import datetime

    log_table = "workspace.default.etl_run_logs"
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    transaction_id = str(uuid.uuid4())
    #print("all good")
    # Log INPROGRESS
    inprogress_log = [(transaction_id, run_id, connection_id, table_id, table_name, now, mode, "INPROGRESS", 'None', target_table, 0, "")]
    inprogress_df = spark.createDataFrame(
        inprogress_log,
        ["transaction_id", "run_id", "connection_id", "table_id", "table_name", "dateandtime", "mode", "status", "error", "target_table", "number_of_rows", "other_comments"]
    )
    inprogress_df.write.format("delta").mode("append").saveAsTable(log_table)

    error = 'None'
    number_of_rows = 0
    try:
        df.write.format("delta").mode(mode).saveAsTable(target_table)
        number_of_rows = df.count()
        # Log COMPLETED
        completed_log = [(transaction_id, run_id, connection_id, table_id, table_name, datetime.now().strftime("%Y-%m-%d %H:%M:%S"), mode, "COMPLETED", 'None', target_table, number_of_rows, "")]
        completed_df = spark.createDataFrame(
            completed_log,
            ["transaction_id", "run_id", "connection_id", "table_id", "table_name", "dateandtime", "mode", "status", "error", "target_table", "number_of_rows", "other_comments"]
        )
        completed_df.write.format("delta").mode("append").saveAsTable(log_table)
    except Exception as e:
        error = str(e)
        # Log ERROR
        error_log = [(transaction_id, run_id, connection_id, table_id, table_name, datetime.now().strftime("%Y-%m-%d %H:%M:%S"), mode, "ERROR", error, target_table, number_of_rows, "")]
        error_df = spark.createDataFrame(
            error_log,
            ["transaction_id", "run_id", "connection_id", "table_id", "table_name", "dateandtime", "mode", "status", "error", "target_table", "number_of_rows", "other_comments"]
        )
        error_df.write.format("delta").mode("append").saveAsTable(log_table)

In [0]:
from datetime import datetime
import uuid



for row in selected_tables:
    #conn = conn_meta.filter(conn_meta.connection_id == row.connection_id).collect()[0]
    tbl = row.table_call_name
    target_table = f"{row.target_db}.{row.target_table_name}" if row.target_db else row.target_table_name
    columns = col_meta.filter(col_meta.table_id == row.table_id).collect()
    col_map = {col.column_name: col.target_column_name if hasattr(col, "target_column_name") and col.target_column_name else col.column_name for col in columns}
    def apply_col_mapping(df, col_map):
        for src, tgt in col_map.items():
            if src != tgt:
                df = df.withColumnRenamed(src, tgt)
        return df
    if row.table_type == "full":
        df = spark.read.jdbc(
            url=jdbc_url,
            table=tbl,
            properties=connection_properties
        )
        df = apply_col_mapping(df, col_map)
        write_to_delta(df, "overwrite", target_table, run_id, row.connection_id, row.table_id, row.table_name)
    elif row.table_type == "incremental":
        try:
            target_df = spark.table(target_table)
            max_watermark = target_df.agg({row.watermark_column: "max"}).collect()[0][0]
        except Exception:
            max_watermark = None
        predicate = f"{row.watermark_column} > '{max_watermark}'" if max_watermark else None
        if predicate:
            connection_properties["predicate"] = predicate
        df = spark.read.jdbc(
            url=jdbc_url,
            table=tbl,
            properties=connection_properties
        )
        df = apply_col_mapping(df, col_map)
        write_to_delta(df, "append", target_table, run_id, row.connection_id, row.table_id, row.table_name)