# ETL Architecture and Loading Strategy

## 1. Introduction
In this notebook, we define the architecture for our Data Warehouse. Based on the project design, we will implement a **Multi-Hop (Medallion) Architecture** consisting of three distinct layers. This structure ensures data quality, allows for restarting pipelines in case of failure (fault tolerance), and separates concerns between raw data ingestion and business logic.

## 2. Architecture Overview

The data will flow through the following layers:

### Layer 1: Landing (Bronze)
*   **Database Name:** `edw_ld`
*   **Source:** Raw files from AWS S3 Data Lake.
*   **Load Strategy:** **Append Mode**.
*   **Data Transformation:** 
    *   No transformations or calculations.
    *   No Joins.
    *   **Schema:** All columns are read as **STRING** to prevent schema mismatch errors during ingestion.
*   **Goal:** To keep a faithful copy of the source data as it arrived.

### Layer 2: Staging (Silver)
*   **Database Name:** `edw_stg`
*   **Source:** Data from Landing Layer (`edw_ld`).
*   **Load Strategy:** **Overwrite/Truncate Mode** (for each batch).
*   **Data Transformation:**
    *   Apply Schema (Cast columns to Integer, Date, etc.).
    *   Perform **all major calculations** and business logic.
    *   Perform basic joins required for transformation.
*   **Goal:** To prepare clean, transformed data ready for the warehouse.

### Layer 3: Data Warehouse (Gold)
*   **Database Name:** `edw`
*   **Source:** Data from Staging Layer (`edw_stg`).
*   **Load Strategy:** **Upsert (Merge)** or Append.
    *   Handles Slowly Changing Dimensions (SCD Type 1 or 2).
*   **Data Transformation:**
    *   Minimal calculations.
    *   Joins to populate Surrogate Keys.
*   **Goal:** Optimized tables for Analytics and Reporting.

---

### 3. Job Control Mechanism
To manage this flow, we will use a **Job Control Table**. This table will:
*   Log the status of every pipeline run.
*   Manage **Incremental Loading** (store the last processed timestamp or file ID).
*   Allow the pipeline to determine if it needs to run a Full Load or Incremental Load.

---

In [None]:
# Setup Spark Session
import pyspark
from pyspark.sql import SparkSession

def get_spark_session():
    spark = SparkSession.builder \
        .appName("05_Architecture_Setup") \
        .master("local[*]") \
        .enableHiveSupport() \
        .getOrCreate()
    return spark

spark = get_spark_session()
print(f"Spark Version: {spark.version}")

## 4. Setting up the Data Lakehouse Layers

We will now create the three databases in our Spark environment that correspond to the layers defined in our strategy.

In [None]:
# Define Database Names based on strategy
databases = {
    "landing": "edw_ld",
    "staging": "edw_stg",
    "warehouse": "edw"
}

# Create Databases
def create_databases(spark, db_dict):
    for layer, db_name in db_dict.items():
        print(f"Creating {layer} layer database: {db_name}...")
        spark.sql(f"CREATE DATABASE IF NOT EXISTS {db_name}")
        print(f"Database {db_name} created (or already exists).")

create_databases(spark, databases)

In [None]:
# Verify Database Creation
print("Listing all databases in the Spark Catalog:")
spark.sql("SHOW DATABASES").show()

## 5. Configuration Strategy

To support the loading strategy mentioned (reading strings in landing, casting in staging), we will define a configuration structure in future notebooks. 

The flow for the upcoming implementation will be:
1.  **Read Config:** Load file paths and schema definitions.
2.  **Check Job Control:** Determine which files to read.
3.  **Run Landing Job:** Ingest raw data to `edw_ld`.
4.  **Run Staging Job:** Read `edw_ld`, transform, write to `edw_stg`.
5.  **Run Warehouse Job:** Merge `edw_stg` into `edw`.