# Medallion Architecture (Multi-Hop Architecture)

**Objective:** Understand the logical organization of data in a Lakehouse using the Medallion Architecture pattern.

## What is Medallion Architecture?
The **Medallion Architecture** is a data design pattern used to logically organize data in a Lakehouse. It is often referred to as a "Multi-Hop" architecture because data moves through different stages (hops) of validation and enrichment.

The goal is to incrementally and progressively improve the structure and quality of data as it flows through each layer of the architecture.

The three distinct layers are:
1.  **Bronze** (Raw)
2.  **Silver** (Validated/Enriched)
3.  **Gold** (Aggregated/Business-Level)

## 1. The Bronze Layer (Raw Data)

The **Bronze layer** is the landing zone for all data from external source systems.

*   **Input:** Streaming sources (Kafka, Event Hubs), Batch sources (DB dumps), or files (CSV, JSON, XML).
*   **Characteristics:**
    *   Stores data "as-is" (Raw format).
    *   Maintains the full history of the data (append-only is common).
    *   Includes metadata columns like load time, source filename, or process ID (Audit Columns).
*   **Why store it as-is?**
    *   It ensures that the original data is always available for reprocessing if logic changes in downstream layers.
    *   Writing to Bronze is optimized for throughput (fast ingestion) using Delta Lake.

## 2. The Silver Layer (Cleansed & Conformed)

The **Silver layer** represents the "Enterprise View" of the data. It is created by reading from the Bronze layer and applying transformations.

*   **Process:**
    *   **Filtering:** Removing corrupt or irrelevant data.
    *   **Cleaning:** Handling null values, formatting dates, standardizing strings.
    *   **Deduplication:** Removing duplicate records.
    *   **Joins:** Joining multiple Bronze tables to create enriched datasets.
*   **Characteristics:**
    *   Data is strongly typed and enforces schema validation.
    *   It serves as a source for ad-hoc analysis by Data Engineers and Analysts who need granular data.

## 3. The Gold Layer (Curated Business Data)

The **Gold layer** is organized for consumption by business users, reporting tools, and ML models.

*   **Process:**
    *   Reading data from the Silver layer.
    *   Applying complex business rules and aggregations (Sums, Averages, KPIs).
*   **Characteristics:**
    *   Highly refined and generally de-normalized (Star Schemas) for read performance.
    *   Contains ready-to-use metrics.
    *   Data volume is typically much smaller than Bronze or Silver due to aggregation.
*   **Consumers:** Power BI, Tableau, Data Scientists, and Management Reporting.

## Data Quality & Governance

Data Quality and Governance are not restricted to a single layer; they apply across the entire pipeline.
*   **Bronze:** Schema enforcement on write.
*   **Silver:** Constraint checks (e.g., `CHECK (age > 0)`), null checks.
*   **Gold:** Business logic validation.

In [None]:
# Conceptual Representation of the Flow in PySpark
# Note: This is a pseudo-code example to visualize the "Multi-Hop" logic.

# --- HOP 1: Ingest to BRONZE ---
# Read raw stream and write to Bronze Delta table
raw_df = spark.readStream.format("cloudFiles").load("/mnt/source/incoming_files")
raw_df.withColumn("ingest_timestamp", current_timestamp()) \
      .writeStream \
      .format("delta") \
      .option("checkpointLocation", "/mnt/delta/bronze/_checkpoints") \
      .table("bronze_table")

# --- HOP 2: Cleanse to SILVER ---
# Read from Bronze, clean, and write to Silver
bronze_df = spark.readStream.table("bronze_table")
silver_df = bronze_df.filter("status = 'ACTIVE'") \
                     .dropDuplicates(["id"]) \
                     .withColumn("clean_name", upper(col("name")))

silver_df.writeStream \
         .format("delta") \
         .option("checkpointLocation", "/mnt/delta/silver/_checkpoints") \
         .table("silver_table")

# --- HOP 3: Aggregate to GOLD ---
# Read from Silver, aggregate, and write to Gold
silver_static = spark.read.table("silver_table") # Often batch for reporting
gold_df = silver_static.groupBy("department") \
                       .agg(sum("sales").alias("total_sales"))

gold_df.write.format("delta").mode("overwrite").saveAsTable("gold_sales_report")

## What's Next?
Now that we understand the **Medallion Architecture**, in the next video, we will implement this architecture practically using **Delta Live Tables (DLT)**. We will see how to automate the flow of data from Bronze to Silver to Gold.