 This notebook orchestrates the population of metadata tables required for metadata-driven data ingestion workflows.
 
 **Key Steps:**
 1. **Metadata Table Management:**  
    - Checks for the existence of required metadata tables (`connection_metadata`, `table_metadata`).
    - Provides instructions if tables are missing.
 
 2. **MetadataManager Class:**  
    - Encapsulates logic to load metadata tables as Spark DataFrames.
 
 3. **Connection Listing:**  
    - Loads available connections from `connection_metadata`.
    - Displays connection details for user selection.
 
 4. **User Input via Widgets:**  
    - Collects user input for connection selection, schema, target database, and storage details using Databricks widgets.
 
 5. **Connection Selection:**  
    - Filters the connections based on user input.
    - Sets the schema context for metadata population.
 
 6. **Metadata Auto-Population:**  
    - Triggers a separate notebook (`metadata_auto_populate`) to populate metadata tables using the provided parameters.
 
 7. **Completion and Next Steps:**  
    - Informs the user upon completion.
    - Guides the user to proceed with metadata-driven ingestion using the populated metadata.
 
 This notebook is intended for data engineers and analysts to set up and manage metadata required for automated data ingestion pipelines in Databricks.

%md
✅ If target_path_type is anything else (default local mount):
The system will default to a general path under /mnt/datalake.

You must provide:

tgt_tbl_name: The target table or file name.

(Optional) target_db: Used as a folder inside /mnt/datalake.

Resulting path example:
/mnt/datalake/<target_db>/<tgt_tbl_name>

✅ If target_path_type is abfss (Azure Data Lake Gen2):
This format is used for writing to Azure Storage accounts with hierarchical namespaces.

You must provide:

target_container: The container name inside your Azure storage account.

target_account: The Azure storage account name.

tgt_tbl_name: The target table or file name.

(Optional) target_db: If provided, it will be used as a folder inside the container.

Resulting path example:
abfss://<target_container>@<target_account>.dfs.core.windows.net/<target_db>/<tgt_tbl_name>

✅ If target_path_type is s3 (Amazon S3):
This format is used for writing to AWS S3 buckets.

You must provide:

target_bucket: Your S3 bucket name.

tgt_tbl_name: The target table or file name.

(Optional) target_db: If provided, it will be used as a folder in the bucket.

Resulting path example:
s3://<target_bucket>/<target_db>/<tgt_tbl_name>

✅ If target_path_type is dbfs (Databricks File System):
This is used to write files under a mounted path in Databricks File System.

You must provide:

target_mount: The DBFS mount point.

tgt_tbl_name: The target table or file name.

(Optional) target_db: Used as a folder inside the mount.

Resulting path example:
/dbfs/mnt/<target_mount>/<target_db>/<tgt_tbl_name>



In [0]:
# Databricks notebook: Metadata Population Orchestration
from pyspark.sql import SparkSession




In [0]:
class MetadataManager:
    def __init__(self, spark: SparkSession):
        self.spark = spark

    def load_connection_metadata(self):
        return self.spark.table('connection_metadata')

    def load_table_metadata(self):
        return self.spark.table('table_metadata')

In [0]:
spark = SparkSession.builder.getOrCreate()
metadata = MetadataManager(spark)

print(metadata)

<__main__.MetadataManager object at 0xffde9b429910>


In [0]:
# 1. Check if metadata tables exist
#def table_exists(table_name):
#    return table_name in [row.name for row in spark.catalog.listTables()]

#required_tables = ["connection_metadata", "table_metadata"]
#missing = [t for t in required_tables if not table_exists(t)]
#if missing:
#    print(f"Missing required tables: {missing}. Please run metadata_tables_ddl.sql first.")
#    #dbutils.notebook.exit("Missing tables")

#print(required_tables)



In [0]:
# 2. List available connections
conn_df = metadata.load_connection_metadata().withColumnRenamed("schema", "schema_name")
#connections = conn_df.collect()
#print("Available connections:")
#for idx, row in enumerate(connections):
#    print(f"{idx+1}. {row.connection_id} ({row.type} @ {row.host})")

In [0]:
# Widgets for connection_id, schema_name, target_db, prefix, suffix, path_type, and storage details
dbutils.widgets.text("connection_id", "", "Connection ID")
dbutils.widgets.text("schema_name", "", "Schema Name")
dbutils.widgets.text("target_db", "", "Target Databricks,abfss,s3,dbfs Database")
dbutils.widgets.text("target_prefix", "", "Target Table Prefix (optional)")
dbutils.widgets.text("target_suffix", "", "Target Table Suffix (optional)")
dbutils.widgets.text("target_path_type", "", "Target Path Type (optional)")
dbutils.widgets.text("target_container", "", "Target Container (for abfss)")
dbutils.widgets.text("target_account", "", "Target Account (for abfss)")
dbutils.widgets.text("target_bucket", "", "Target Bucket (for s3)")
dbutils.widgets.text("target_mount", "", "Target Mount (for dbfs)")

connection_id = dbutils.widgets.get("connection_id")
schema_name = dbutils.widgets.get("schema_name")
target_db = dbutils.widgets.get("target_db")
target_prefix = dbutils.widgets.get("target_prefix")
target_suffix = dbutils.widgets.get("target_suffix")
target_path_type = dbutils.widgets.get("target_path_type")
target_container = dbutils.widgets.get("target_container")
target_account = dbutils.widgets.get("target_account")
target_bucket = dbutils.widgets.get("target_bucket")
target_mount = dbutils.widgets.get("target_mount")

In [0]:
# 3. Prompt user for connection selection and schema name
#dbutils.widgets.text("connection_id", "")
#connection_id = dbutils.widgets.get("connection_id")


selected_conn = conn_df.filter(conn_df["connection_id"] == connection_id).collect()[0]
print(f"Selected connection: {selected_conn.connection_id} ({selected_conn.type})")
schema_name = selected_conn.schema_name
print(f"Schema: {schema_name}")

Selected connection: conn_postgres_chinook (postgresql)
Schema: public


In [0]:
# 4. Trigger the auto-populate notebook
result = dbutils.notebook.run("metadata_auto_populate", 600, {
    "connection_id": selected_conn.connection_id,
    "schema_name": schema_name,
    "target_db": target_db,
    "target_prefix": target_prefix,
    "target_suffix": target_suffix,
    "target_path_type": target_path_type,
    "target_container": target_container,
    "target_account": target_account,
    "target_bucket": target_bucket,
    "target_mount": target_mount
})
print(result)

None


In [0]:
# 5. Show summary and next steps
print("\nMetadata population complete.")
print("You may now proceed to run metadata_driven_ingestion.py to ingest data from the populated tables.")

