# Database & Table Initialization

## 1. Overview
In this notebook, we will lay the foundation for our Data Lakehouse. We will:
1.  Initialize the **Spark Session** with Delta Lake support.
2.  Create the necessary **Databases** (Layers).
    *   `edw_ld`: Landing Layer (Raw Data)
    *   `edw_stg`: Staging Layer (Intermediate Processing)
    *   `edw`: Enterprise Data Warehouse (Final Consumption)
3.  Create the **Dimension** and **Fact** tables using Delta format.
4.  Create a **Job Control** table to track our pipeline executions.

## 2. Spark Session Setup
We need to ensure our Spark Session is configured to use the **Delta Lake** engine and Hive catalog support.

In [None]:
import pyspark
from pyspark.sql import SparkSession
import os

def get_spark_session(app_name="Init_Database"):
    """
    Creates a Spark Session with Delta Lake configurations.
    """
    spark = SparkSession.builder \
        .appName(app_name) \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .enableHiveSupport() \
        .getOrCreate()
    return spark

# Initialize Spark
spark = get_spark_session()
print(f"Spark Version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## 3. Create Database Layers

We will create three distinct databases to organize our data lifecycle:
1.  **Landing (`edw_ld`):** Where raw files are mapped initially.
2.  **Staging (`edw_stg`):** Temporary storage for transformations.
3.  **EDW (`edw`):** The final modeled data (Star Schema).

In [None]:
# Create Databases
databases = ["edw", "edw_stg", "edw_ld"]

for db in databases:
    print(f"Creating Database: {db}")
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {db}")

# Verify creation
spark.sql("SHOW DATABASES").show()

## 4. Create Dimension Tables (`edw` layer)

We will now create the tables in the `edw` database. These tables are **Managed Tables** stored in Delta format.

### A. Store Dimension (`dim_store`)
Contains details about the retail store locations.

In [None]:
# Create Store Dimension
spark.sql("""
CREATE TABLE IF NOT EXISTS edw.dim_store (
    row_wid STRING,
    store_id STRING,
    store_name STRING,
    address STRING,
    city STRING,
    state STRING,
    zip_code STRING,
    phone_number STRING,
    manager_name STRING,
    insert_dt TIMESTAMP,
    update_dt TIMESTAMP
) USING DELTA
""")

print("dim_store created successfully.")

### B. Product Dimension (`dim_product`)
Contains product catalog information. We will use **SCD Type 2** logic here (implied by `effective_start_dt`, `effective_end_dt`, and `active_flg`).

In [None]:
# Create Product Dimension
spark.sql("""
CREATE TABLE IF NOT EXISTS edw.dim_product (
    row_wid STRING,
    product_id STRING,
    product_name STRING,
    brand STRING,
    category STRING,
    unit_price DOUBLE,
    size STRING,
    uom STRING,
    image_url STRING,
    effective_start_dt TIMESTAMP,
    effective_end_dt TIMESTAMP,
    active_flg INT,
    insert_dt TIMESTAMP,
    update_dt TIMESTAMP
) USING DELTA
""")

print("dim_product created successfully.")

### C. Customer Dimension (`dim_customer`)
Contains customer details. This is also modeled for **SCD Type 2** to track address or email changes over time.

In [None]:
# Create Customer Dimension
spark.sql("""
CREATE TABLE IF NOT EXISTS edw.dim_customer (
    row_wid STRING,
    customer_id STRING,
    first_name STRING,
    last_name STRING,
    address STRING,
    city STRING,
    state STRING,
    zip_code STRING,
    phone_number STRING,
    email STRING,
    plan_type_id STRING,
    effective_start_dt TIMESTAMP,
    effective_end_dt TIMESTAMP,
    active_flg INT,
    insert_dt TIMESTAMP,
    update_dt TIMESTAMP
) USING DELTA
""")

print("dim_customer created successfully.")

## 5. Create Fact Table

### Sales Fact (`fact_sales`)
This table holds the transactional data. It links to dimensions via the `_wid` (Warehouse ID) columns.

In [None]:
# Create Sales Fact Table
spark.sql("""
CREATE TABLE IF NOT EXISTS edw.fact_sales (
    row_wid STRING,
    date_id STRING,
    store_wid STRING,
    product_wid STRING,
    customer_wid STRING,
    order_id STRING,
    invoice_num STRING,
    qty INT,
    tax DOUBLE,
    line_total DOUBLE,
    insert_dt TIMESTAMP,
    update_dt TIMESTAMP
) USING DELTA
""")

print("fact_sales created successfully.")

## 6. Infrastructure Tables

### Job Control Table (`job_control`)
This table is crucial for our ETL pipelines. It tracks the status of data loads (e.g., "Success", "Failed") and the high-water mark (`max_timestamp`) to support incremental loading.

In [None]:
# Create Job Control Table
spark.sql("""
CREATE TABLE IF NOT EXISTS edw.job_control (
    job_id STRING,
    job_name STRING,
    status STRING,
    max_timestamp TIMESTAMP,
    rundate STRING,
    insert_dt TIMESTAMP
) USING DELTA
""")

print("job_control created successfully.")

## 7. Verification

Let's list all the tables in our `edw` database to confirm everything is set up correctly.

In [None]:
# Verify Tables
print("Tables in EDW:")
spark.sql("SHOW TABLES IN edw").show(truncate=False)