%md
# Silver Transformation — Gulf to Bay Databricks

This notebook performs the Silver‑layer refinement for the Gulf to Bay modernization pipeline. Silver tables apply deterministic cleaning, type normalization, deduplication, and business‑ready shaping on top of the raw Bronze Delta tables. The goal is to produce analytics‑ready, conformed datasets that support downstream Gold modeling and semantic layer construction.

The transformations in this notebook preserve source fidelity while enforcing consistent schema standards across sales, customers, and products. Each Silver table is written as a managed Delta table in Unity Catalog and validated for structural integrity.

In [None]:
# ============================================================
#  SILVER CONFIG — Paths, Table Names, and Setup
#  - Defines catalog/schema locations
#  - Establishes Bronze input tables
#  - Establishes Silver output tables
#  - Keeps all pathing centralized for maintainability
# ============================================================

from pyspark.sql import functions as F
from pyspark.sql import types as T

# Unity Catalog locations
CATALOG = "gulf_to_bay_databricks"
SCHEMA = "default"

# Bronze source tables (input)
BRONZE = {
    "sales":     f"{CATALOG}.{SCHEMA}.bronze_sales",
    "customers": f"{CATALOG}.{SCHEMA}.bronze_customers",
    "products":  f"{CATALOG}.{SCHEMA}.bronze_products"
}

# Silver target tables (output)
SILVER = {
    "sales":     f"{CATALOG}.{SCHEMA}.sales_silver",
    "customers": f"{CATALOG}.{SCHEMA}.customers_silver",
    "products":  f"{CATALOG}.{SCHEMA}.products_silver"
}

In [None]:
# ============================================================
#  LOAD BRONZE TABLES
#  - Reads raw Bronze Delta tables from Unity Catalog
#  - No transformations applied here
#  - Ensures all downstream logic uses consistent DataFrames
# ============================================================

df_sales_bronze     = spark.table(BRONZE["sales"])
df_customers_bronze = spark.table(BRONZE["customers"])
df_products_bronze  = spark.table(BRONZE["products"])

print("Bronze tables loaded successfully.")

In [None]:
# ============================================================
#  CLEANUP & TYPE NORMALIZATION
#  - Standardizes column names (lowercase, trimmed)
#  - Trims whitespace from all string columns
#  - Removes duplicate rows
#  - Produces a clean, predictable schema for Silver
# ============================================================

def normalize_columns(df):
    # Standardize column names to lowercase with no leading/trailing spaces
    for col in df.columns:
        df = df.withColumnRenamed(col, col.strip().lower())
    return df

def clean_string_columns(df):
    # Identify all string columns and trim whitespace
    string_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, T.StringType)]
    for col in string_cols:
        df = df.withColumn(col, F.trim(F.col(col)))
    return df

def apply_basic_cleanup(df):
    # Apply all cleanup steps in a deterministic order
    df = normalize_columns(df)
    df = clean_string_columns(df)
    df = df.dropDuplicates()
    return df

# Apply cleanup to each Bronze dataset
sales_clean     = apply_basic_cleanup(df_sales_bronze)
customers_clean = apply_basic_cleanup(df_customers_bronze)
products_clean  = apply_basic_cleanup(df_products_bronze)

In [None]:
# ============================================================
#  BUSINESS RULES & TRANSFORMATIONS
#  - Applies domain-specific logic
#  - Converts date columns where applicable
#  - Placeholder for additional business rules as needed
# ============================================================

def convert_dates(df, date_cols):
    # Convert specified columns to proper date type if they exist
    for col in date_cols:
        if col in df.columns:
            df = df.withColumn(col, F.to_date(F.col(col)))
    return df

# Apply business rules to each dataset
sales_silver     = convert_dates(sales_clean, ["order_date", "ship_date"])
customers_silver = customers_clean
products_silver  = products_clean

In [None]:
# ============================================================
#  WRITE SILVER TABLES TO UNITY CATALOG
#  - Writes cleaned, conformed datasets as Delta tables
#  - Overwrites existing Silver tables for deterministic runs
#  - Ensures downstream Gold modeling uses stable inputs
# ============================================================

(
    sales_silver
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(SILVER["sales"])
)

(
    customers_silver
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(SILVER["customers"])
)

(
    products_silver
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(SILVER["products"])
)

print("Silver tables written successfully.")

In [None]:
# ============================================================
#  VALIDATION PREVIEW — SAMPLE SILVER OUTPUT
#  - Displays a small sample of the Silver sales table
#  - Confirms schema, types, and transformations
# ============================================================

display(sales_silver.limit(20))