In this utility, we have developed a highly reusable and configurable framework for managing data connections and orchestrating data ingestion workflows across various data sources and targets. By leveraging Databricks widgets, users can dynamically specify connection parameters, authentication details, and target destinations without modifying the underlying code. This approach enables seamless integration with multiple databases, cloud storage solutions (such as abfss, s3, dbfs), and supports flexible schema and table management.

Key achievements and features:
- **Parameterization**: All critical connection and ingestion parameters are exposed as widgets, allowing users to easily adapt the workflow to new sources, targets, and environments.
- **Modularity**: The design supports both single-table and multi-table ingestion scenarios, with options for schema, table, and path customization.
- **Security**: Sensitive information such as usernames and passwords are handled securely through widgets, minimizing hardcoding and exposure.
- **Cloud Agnostic**: The framework supports ingestion to Databricks, Azure Data Lake (abfss), Amazon S3, and DBFS, making it suitable for hybrid and multi-cloud architectures.
- **Reusability**: The utility can be reused across projects and teams by simply updating widget values, reducing development time and promoting best practices.
- **Automation Ready**: With support for cron expressions and parameter-driven execution, the solution is ready for scheduling and automation in production pipelines.

This utility empowers data engineers and analysts to rapidly onboard new data sources, standardize ingestion processes, and accelerate data-driven initiatives with minimal code changes and maximum flexibility.


**Parameter Descriptions**

The following parameters are used to configure the connection and ingestion process:

- connection_id: Unique identifier for the connection configuration.
- type: Type of the source database (e.g., postgres, mysql, etc.).
- host: Hostname or IP address of the source database server.
- port: Port number for the source database connection (integer).
- database: Name of the source database.
- schema: Schema name within the source database.
- username: Username for authenticating to the source database.
- password: Password for authenticating to the source database.
- options: Additional connection options in key-value format (optional).
- table_id: Specific table ID to process (optional, for single table operations).
- cron_expression: Cron expression for scheduling ingestion jobs (optional).
- schema_name: Name of the schema to be ingested or processed.
- target_db: Target database in Databricks, abfss, s3, or dbfs where data will be stored.
- target_prefix: Prefix to add to target table names (optional).
- target_suffix: Suffix to add to target table names (optional).
- target_path_type: Type of target path (optional, e.g., managed, external, abfss, s3, dbfs).
- target_container: Target container name for abfss storage (required for abfss).
- target_account: Target account name for abfss storage (required for abfss).
- target_bucket: Target bucket name for s3 storage (required for s3).
- target_mount: Target mount point for dbfs storage (required for dbfs).

**Target Path Usage**

- Default (local mount or anything else): The system will default to a general path under /mnt/datalake.
  - You must provide:
    - tgt_tbl_name: The target table or file name.
    - (Optional) target_db: Used as a folder inside /mnt/datalake.
  - Resulting path example: /mnt/datalake/<target_db>/<tgt_tbl_name>

- abfss (Azure Data Lake Gen2): Used for writing to Azure Storage accounts with hierarchical namespaces.
  - You must provide:
    - target_container: The container name inside your Azure storage account.
    - target_account: The Azure storage account name.
    - tgt_tbl_name: The target table or file name.
    - (Optional) target_db: Used as a folder inside the container.
  - Resulting path example: abfss://<target_container>@<target_account>.dfs.core.windows.net/<target_db>/<tgt_tbl_name>

- s3 (Amazon S3): Used for writing to AWS S3 buckets.
  - You must provide:
    - target_bucket: Your S3 bucket name.
    - tgt_tbl_name: The target table or file name.
    - (Optional) target_db: Used as a folder in the bucket.
  - Resulting path example: s3://<target_bucket>/<target_db>/<tgt_tbl_name>

- dbfs (Databricks File System): Used to write files under a mounted path in Databricks File System.
  - You must provide:
    - target_mount: The DBFS mount point.
    - tgt_tbl_name: The target table or file name.
    - (Optional) target_db: Used as a folder inside the mount.
  - Resulting path example: /dbfs/mnt/<target_mount>/<target_db>/<tgt_tbl_name>
"""

**Notebook Overview**
This notebook is designed to manage and orchestrate metadata and ingestion workflows for various data connections in a structured and automated way. It performs the following key functions:

**Widget Setup**
Captures input parameters such as connection type, host, port, database, schema, username, password, optional settings, and a cron expression using Databricks widgets.

**Parameter Retrieval**
Reads the values provided via widgets for use in the logic that follows.

**Connection Metadata Management**

Checks for existing connection metadata in the workspace.default.connection_metadata table.

If found, it loads the existing parameters.

If not, it inserts the new connection metadata into the table.

**Metadata Table Initialization**
Calls a supporting notebook to ensure that required metadata tables are created.

**Table Metadata Population**
If no table metadata is available for the connection, it triggers a process to populate it.

**Column Metadata Aggregation**
Fetches and summarizes table and column metadata, calculating and displaying the number of columns per table.

**Scheduled Ingestion Setup**
If a cron expression is provided, sets up a scheduled ingestion job for the specified connection.

**Immediate Ingestion Trigger**
Initiates an immediate data ingestion run for the connection.

The notebook includes conditional logic to prevent redundant metadata creation or updates, and allows ingestion to be triggered based on user-defined options (either scheduled or immediate).







# Database driver requirements for common databases:

 **MySQL:**

   Install JDBC driver JAR (e.g., mysql-connector-java-8.0.33.jar)
   %pip install mysql-connector-python

**MariaDB:**

   Install JDBC driver JAR (e.g., mariadb-java-client-3.3.2.jar)
   %pip install mariadb

 **Oracle:**

   Install JDBC driver JAR (e.g., ojdbc8.jar)
   %pip install cx_Oracle

 **SQL Server:**

   Install JDBC driver JAR (e.g., mssql-jdbc-12.4.2.jre8.jar)
   %pip install pyodbc

 **PostgreSQL:**

  %pip install psycopg2-binary

 **Snowflake:**

   %pip install snowflake-connector-python

 **BigQuery:**

   %pip install google-cloud-bigquery

 **Redshift:**

   %pip install redshift-connector

 **IBM DB2:**

   %pip install ibm-db

 **SAP HANA:**
 
  %pip install hdbcli

In [0]:
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [0]:
dbutils.widgets.text("connection_id", "", "Connection ID")
dbutils.widgets.text("type", "")
dbutils.widgets.text("host", "")
dbutils.widgets.text("port", "", "Port (integer)")
dbutils.widgets.text("database", "")
dbutils.widgets.text("schema", "")
dbutils.widgets.text("username", "")
dbutils.widgets.text("password", "")
dbutils.widgets.text("options", "")
dbutils.widgets.text("table_id", "", "Table ID (optional, for single table)")
dbutils.widgets.text("cron_expression", "")

dbutils.widgets.text("schema_name", "", "Schema Name")
dbutils.widgets.text("target_db", "", "Target Databricks,abfss,s3,dbfs Database")
dbutils.widgets.text("target_prefix", "", "Target Table Prefix (optional)")
dbutils.widgets.text("target_suffix", "", "Target Table Suffix (optional)")
dbutils.widgets.text("target_path_type", "", "Target Path Type (optional)")
dbutils.widgets.text("target_container", "", "Target Container (for abfss)")
dbutils.widgets.text("target_account", "", "Target Account (for abfss)")
dbutils.widgets.text("target_bucket", "", "Target Bucket (for s3)")
dbutils.widgets.text("target_mount", "", "Target Mount (for dbfs)")

In [0]:
connection_id = dbutils.widgets.get("connection_id")
type_ = dbutils.widgets.get("type")
host = dbutils.widgets.get("host")
port_str = dbutils.widgets.get("port")
port = int(port_str) if port_str else 5432
database = dbutils.widgets.get("database")
schema = dbutils.widgets.get("schema")
username = dbutils.widgets.get("username")
password = dbutils.widgets.get("password")
options = dbutils.widgets.get("options")
cron_expression = dbutils.widgets.get("cron_expression")

schema_name = dbutils.widgets.get("schema_name")
target_db = dbutils.widgets.get("target_db")
target_prefix = dbutils.widgets.get("target_prefix")
target_suffix = dbutils.widgets.get("target_suffix")
target_path_type = dbutils.widgets.get("target_path_type")
target_container = dbutils.widgets.get("target_container")
target_account = dbutils.widgets.get("target_account")
target_bucket = dbutils.widgets.get("target_bucket")
target_mount = dbutils.widgets.get("target_mount")

In [0]:
def print_driver_requirements(db_type):
    db_type = db_type.lower()
    if db_type == "mysql":
        print("For MySQL, you must install the JDBC driver JAR (e.g., mysql-connector-java-8.0.33.jar) and also run: %pip install mysql-connector-python")
        %pip install mysql-connector-python
    elif db_type == "mariadb":
        print("For MariaDB, you must install the JDBC driver JAR (e.g., mariadb-java-client-3.3.2.jar) and also run: %pip install mariadb")
        %pip install mariadb
    elif db_type == "oracle":
        print("For Oracle, you must install the JDBC driver JAR (e.g., ojdbc8.jar) and also run: %pip install cx_Oracle")
        %pip install cx_Oracle
    elif db_type == "sqlserver":
        print("For SQL Server, you must install the JDBC driver JAR (e.g., mssql-jdbc-12.4.2.jre8.jar) and also run: %pip install pyodbc")
        %pip install pyodbc
    elif db_type == "postgresql":
        print("For PostgreSQL, you only need: %pip install psycopg2-binary")
        %pip install psycopg2-binary
    elif db_type == "snowflake":
        print("For Snowflake, you only need: %pip install snowflake-connector-python")
        %pip install snowflake-connector-python
    elif db_type == "bigquery":
        print("For BigQuery, you only need: %pip install google-cloud-bigquery")
        %pip install google-cloud-bigquery
    elif db_type == "redshift":
        print("For Redshift, you only need: %pip install redshift-connector")
        %pip install redshift-connector
    elif db_type == "db2":
        print("For IBM DB2, you only need: %pip install ibm-db")
        %pip install ibm-db
    elif db_type == "hana":
        print("For SAP HANA, you only need: %pip install hdbcli")
        %pip install hdbcli
    else:
        print(f"Unsupported RDBMS type: {db_type}")

print_driver_requirements(type_)

For MySQL, you must install the JDBC driver JAR (e.g., mysql-connector-java-8.0.33.jar) and also run: %pip install mysql-connector-python
Collecting mysql-connector-python
  Downloading mysql_connector_python-9.4.0-cp311-cp311-manylinux_2_28_aarch64.whl.metadata (7.3 kB)
Downloading mysql_connector_python-9.4.0-cp311-cp311-manylinux_2_28_aarch64.whl (33.5 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/33.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m30.4/33.5 MB[0m [31m174.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m121.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mysql-connector-python
Successfully installed mysql-connector-python-9.4.0
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
existing_df = spark.table("workspace.default.connection_metadata").filter(f"connection_id = '{connection_id}'")
if existing_df.count() > 0:
    row = existing_df.collect()[0]
    type_ = row['type']
    host = row['host']
    port = row['port']
    database = row['database']
    schema = row['schema']
    username = row['username']
    password = row['password']
    options = row['options']
    print("connection already exist as "+connection_id)
else:
    from pyspark.sql import Row

    new_row = Row(
        connection_id=connection_id,
        type=type_,
        host=host,
        port=port,
        database=database,
        schema=schema,
        username=username,
        password=password,
        options=options
    )
    schema = StructType([
    StructField("connection_id", StringType(), True),
    StructField("type", StringType(), True),
    StructField("host", StringType(), True),
    StructField("port", IntegerType(), True),  # Set type explicitly
    StructField("database", StringType(), True),
    StructField("schema", StringType(), True),
    StructField("username", StringType(), True),
    StructField("password", StringType(), True),
    StructField("options", StringType(), True)
])
    new_df = spark.createDataFrame([new_row],schema)
    #new_df.printSchema()
    #spark.table("workspace.default.connection_metadata").printSchema()
    new_df.write.format("delta").mode("append").saveAsTable("workspace.default.connection_metadata")

In [0]:
result = dbutils.notebook.run("metadata_tables_ddl", 600)
display(result)

In [0]:
print(schema)

StructType([StructField('connection_id', StringType(), True), StructField('type', StringType(), True), StructField('host', StringType(), True), StructField('port', IntegerType(), True), StructField('database', StringType(), True), StructField('schema', StringType(), True), StructField('username', StringType(), True), StructField('password', StringType(), True), StructField('options', StringType(), True)])


In [0]:
if spark.sql(f"SELECT 1 FROM workspace.default.table_metadata WHERE connection_id = '{connection_id}' LIMIT 1").count() == 0:
    params = {
        "connection_id": connection_id,
    "schema_name": schema_name,
    "target_db": target_db,
    "target_prefix": target_prefix,
    "target_suffix": target_suffix,
    "target_path_type": target_path_type,
    "target_container": target_container,
    "target_account": target_account,
    "target_bucket": target_bucket,
    "target_mount": target_mount
    }
    result = dbutils.notebook.run("metadata_population_orchestration", 600, params)
    display(result)

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-8168839338265294>, line 14[0m
[1;32m      1[0m [38;5;28;01mif[39;00m spark[38;5;241m.[39msql([38;5;124mf[39m[38;5;124m"[39m[38;5;124mSELECT 1 FROM workspace.default.table_metadata WHERE connection_id = [39m[38;5;124m'[39m[38;5;132;01m{[39;00mconnection_id[38;5;132;01m}[39;00m[38;5;124m'[39m[38;5;124m LIMIT 1[39m[38;5;124m"[39m)[38;5;241m.[39mcount() [38;5;241m==[39m [38;5;241m0[39m:
[1;32m      2[0m     params [38;5;241m=[39m {
[1;32m      3[0m         [38;5;124m"[39m[38;5;124mconnection_id[39m[38;5;124m"[39m: connection_id,
[1;32m      4[0m     [38;5;124m"[39m[38;5;124mschema_name[39m[38;5;124m"[39m: schema_name,
[0;32m   (...)[0m
[1;32m     12[0m     [38;5;124m"[39m[38;5;124mtarget_mount[39m[38;5;124m"[39m: target_mount
[1;32m     

In [0]:
df_table_metadata = spark.sql(f"SELECT * FROM workspace.default.table_metadata WHERE connection_id = '{connection_id}'")
#display(df_table_metadata)

table_ids = [row.table_id for row in df_table_metadata.select("table_id").distinct().collect()]
table_ids_str = ",".join([f"'{tid}'" for tid in table_ids])

df_column_metadata = spark.sql(f"SELECT * FROM workspace.default.column_metadata WHERE table_id IN ({table_ids_str})")

df_table_col_count = (
    df_column_metadata.groupBy("table_id")
    .count()
    .join(df_table_metadata.select("table_id", "table_name"), on="table_id", how="left")
    .select("table_name", "count")
    .orderBy("table_name")
)

display(df_table_col_count)

table_name,count
Album,3
Artist,2
Customer,13
Employee,15
Genre,2
Invoice,9
InvoiceLine,5
MediaType,2
Playlist,2
PlaylistTrack,2


In [0]:
if cron_expression:
    params = {
        "connection_id": connection_id,
        "cron_expression": cron_expression,
        "ingest_mode": "schedule"
    }
    result = dbutils.notebook.run("ingestion_orchestration", 600, params)
    display(result)

In [0]:
import uuid
run_id = str(uuid.uuid4())
print(run_id)

c584ec4a-555d-4933-b0df-ccdef47ca76d


In [0]:
%sql
/*update workspace.default.table_metadata 
set onboarded_flag = 'N' 
where connection_id = 'conn_postgres_chinook'*/


[0;31m---------------------------------------------------------------------------[0m
[0;31m_MultiThreadedRendezvous[0m                  Traceback (most recent call last)
File [0;32m/databricks/python/lib/python3.11/site-packages/pyspark/sql/connect/client/reattach.py:172[0m, in [0;36mExecutePlanResponseReattachableIterator._has_next[0;34m(self, is_last)[0m
[1;32m    171[0m [38;5;28;01mtry[39;00m:
[0;32m--> 172[0m     [38;5;28mself[39m[38;5;241m.[39m_current [38;5;241m=[39m [38;5;28mself[39m[38;5;241m.[39m_call_iter(
[1;32m    173[0m         [38;5;28;01mlambda[39;00m: [38;5;28mnext[39m([38;5;28mself[39m[38;5;241m.[39m_iterator)  [38;5;66;03m# type: ignore[arg-type][39;00m
[1;32m    174[0m     )
[1;32m    175[0m [38;5;28;01mexcept[39;00m [38;5;167;01mStopIteration[39;00m:

File [0;32m/databricks/python/lib/python3.11/site-packages/pyspark/sql/connect/client/reattach.py:297[0m, in [0;36mExecutePlanResponseReattachableIterator._call_iter[0;3

In [0]:

params = {
    "connection_id": connection_id
    ,"ingest_mode": "immediate"
    ,"run_id": run_id
}

result = dbutils.notebook.run("ingestion_orchestration", 600, params)
display(result)

In [0]:
%sql
select etl_run_logs.* from run_metadata
join etl_run_logs on etl_run_logs.run_id=run_metadata.run_id
where run_metadata.run_id='5c85f060-baa1-496b-a006-ea5bac4e00fe'
--and status = 'COMPLETED'
--and error != 'None'
--and status='COMPLETED'
--and table_name='albums'

transaction_id,run_id,connection_id,table_id,table_name,dateandtime,mode,status,error,target_table,number_of_rows,other_comments
26f989e7-4a75-410e-8762-38df7b6fc0f3,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventory_sets,lego_inventory_sets,2025-08-04 11:48:56,overwrite,INPROGRESS,,default.stg_lego_inventory_sets,0,
26f989e7-4a75-410e-8762-38df7b6fc0f3,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventory_sets,lego_inventory_sets,2025-08-04 11:49:02,overwrite,COMPLETED,,default.stg_lego_inventory_sets,2846,
61c9efa1-e227-42f1-9129-56f6eea5877f,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_part_categories,lego_part_categories,2025-08-04 11:49:04,overwrite,INPROGRESS,,default.stg_lego_part_categories,0,
61c9efa1-e227-42f1-9129-56f6eea5877f,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_part_categories,lego_part_categories,2025-08-04 11:49:08,overwrite,COMPLETED,,default.stg_lego_part_categories,57,
265269dd-58cc-4d5d-bd7a-f49e0e96df34,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_colors,lego_colors,2025-08-04 11:49:10,overwrite,INPROGRESS,,default.stg_lego_colors,0,
265269dd-58cc-4d5d-bd7a-f49e0e96df34,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_colors,lego_colors,2025-08-04 11:49:14,overwrite,COMPLETED,,default.stg_lego_colors,135,
bab28eb5-eabf-47ee-8257-8259d14f8b4b,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventories,lego_inventories,2025-08-04 11:49:16,overwrite,INPROGRESS,,default.stg_lego_inventories,0,
bab28eb5-eabf-47ee-8257-8259d14f8b4b,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventories,lego_inventories,2025-08-04 11:49:20,overwrite,COMPLETED,,default.stg_lego_inventories,11681,
0b7ea3cf-0561-41d3-a365-f7eb77c576d3,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventory_parts,lego_inventory_parts,2025-08-04 11:49:21,overwrite,INPROGRESS,,default.stg_lego_inventory_parts,0,
0b7ea3cf-0561-41d3-a365-f7eb77c576d3,5c85f060-baa1-496b-a006-ea5bac4e00fe,conn_postgres_chinook,conn_postgres_chinook_lego_inventory_parts,lego_inventory_parts,2025-08-04 11:49:26,overwrite,COMPLETED,,default.stg_lego_inventory_parts,580251,
