# Introducing Medallion Architecture
Medallion Architecture, also known as multi-hop architecture, is a data design pattern that organizes data into progressive layers. Its main objective is to incrementally improve the structure, quality, and usability of data as it moves through each layer.

### The Layered Approach
The architecture is built around three primary layers, each with a specific role in the data refinement process:

- **Bronze Layer**
- **Silver Layer**
- **Gold Layer**

These layers are named to reflect their increasing levels of data quality and business value.

### Bronze Layer
The `bronze layer` is the foundation of the architecture. Here, data is ingested in its rawest form directly from source systems. This unprocessed data is stored in bronze tables exactly as received—no transformations are applied.

**Purpose and Benefits:**

- Preserves the original data for traceability and auditing
- Acts as a single source of truth
- Supports a wide range of sources (structured files, operational databases, streaming platforms like Kafka)

### Silver Layer
Data is cleansed, standardized, and enriched as it moves into the silver layer.

**Key activities:**

- Removing duplicates and errors
- Normalizing inconsistent formats
- Validating data quality
- Joining datasets to create a more integrated view

Reliable, consistent data ready for analysis and consumption by downstream systems.

### Gold Layer
In the gold layer, data is transformed into its most refined, business-ready form.

**Characteristics:**

- Aggregation and summarization tailored to business needs
- Creation of curated datasets for reporting, dashboards, and advanced analytics
- Often used as the foundation for machine learning and AI models

In [0]:
from pyspark.sql.functions import *
import uuid
import json
import random
from datetime import datetime, timedelta
import pandas as pd

# Table names
bronze_table = "bronze_sales_tbl"
silver_table = "silver_sales_tbl"
gold_table   = "gold_sales_tbl"

# Paths for Auto Loader input and checkpoints
sales_path = "/sales_order"
bronze_checkpoint = "/checkpoint/bronze"
silver_checkpoint = "/checkpoint/silver"
gold_checkpoint   = "/checkpoint/gold"

In [0]:
spark.sql("USE CATALOG hive_metastore")
spark.sql("DROP DATABASE IF EXISTS medallion CASCADE")
spark.sql("CREATE DATABASE medallion")
spark.sql("USE medallion")

In [0]:
spark.sql("SHOW TABLES").show()

In [0]:
# Generate static customer table
customer_data = [
    (1, "Alice", "North"),
    (2, "Bob", "South"),
    (3, "Charlie", "East"),
    (4, "Diana", "West"),
    (5, "Eve", "North"),
    (6, "Frank", "South"),
    (7, "Grace", "East"),
    (8, "Hank", "West"),
    (9, "Ivy", "North"),
    (10, "Jack", "South"),
    (11, "Kathy", "East"),
    (12, "Leo", "West"),
    (13, "Mona", "North"),
    (14, "Nina", "South"),
    (15, "Oscar", "East")
]
customer_df = spark.createDataFrame(customer_data, ["customer_id", "customer_name", "region"])
customer_df.write.mode("overwrite").format("delta").saveAsTable("customers")

# Generate static product table
product_data = [
    (101, "Laptop", 800),
    (102, "Mouse", 25),
    (103, "Keyboard", 45),
    (104, "Monitor", 180),
    (105, "Headphones", 60),
    (106, "Webcam", 70),
    (107, "Printer", 150),
    (108, "Tablet", 300),
    (109, "Smartphone", 600),
    (110, "Speakers", 120),
    (111, "Router", 90),
    (112, "External Hard Drive", 100)
]
product_df = spark.createDataFrame(product_data, ["product_id", "product_name", "unit_price"])
product_df.write.mode("overwrite").format("delta").saveAsTable("products")

### Sales Data Generator Function

In [0]:
# Ensure the directory exists
dbutils.fs.mkdirs(sales_path)

def generate_sales_data(num_records=5):
    sales = []
    for _ in range(num_records):
        sales.append({
            "order_id": str(uuid.uuid4()),
            "customer_id": random.choice([1, 2, 3, 4, 5]),
            "product_id": random.choice([101, 102, 103, 104, 105]),
            "quantity": random.randint(1, 5),
            "order_timestamp": (datetime.now() - timedelta(minutes=random.randint(1, 60))).strftime("%Y-%m-%d %H:%M:%S")
        })

    file_name = f"/dbfs{sales_path}/sales_{uuid.uuid4()}.json"
    
    with open(file_name, "w") as f:
        for record in sales:
            f.write(json.dumps(record) + "\n")  # Write as newline-delimited JSON (NDJSON)

    print(f"{num_records} sales records written to {file_name}")

def listStream():
    for q in spark.streams.active:
      print("Id: ", q.id, "Streaming: ", q.isActive)

def stopStream():
    for q in spark.streams.active:
      q.stop()
      q.awaitTermination()

def cleanup():
    stopStream()
    dbutils.fs.rm("/checkpoint", True)
    dbutils.fs.rm("/schema", True)
    dbutils.fs.rm(sales_path, True)
    spark.sql("DROP DATABASE IF EXISTS medallion CASCADE")

In [0]:
generate_sales_data(10)
display(dbutils.fs.ls(sales_path))

### Auto Loader – Stream Raw Files to Bronze

In [0]:
bronze_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"/schema/{bronze_table}")
    .load(sales_path)
    .withColumn("input_file", input_file_name())
    .withColumn("load_time", current_timestamp())
)

# Write to managed Bronze table
(bronze_stream.writeStream
 .outputMode("append")
 .option("checkpointLocation", bronze_checkpoint)
 .toTable(bronze_table))


In [0]:
spark.table(bronze_table).display()

In [0]:
generate_sales_data(5)  # Add more records

In [0]:
spark.table(bronze_table).display()

### Stream Bronze to Silver with Lookup Join

In [0]:
# Load static customer lookup
customer_lkp_df = spark.table("customers")

# Read Bronze as stream
bronze_read_stream = spark.readStream.table(bronze_table)

# Enrich and transform
silver_stream = (
    bronze_read_stream
    .join(customer_lkp_df, on="customer_id", how="left")
    .withColumn("order_date", to_date("order_timestamp"))
    .select("order_id", "customer_id", "region", "product_id", "quantity", "order_date", "load_time")
)

# Write to managed Silver table
(silver_stream.writeStream
 .outputMode("append")
 .option("checkpointLocation", silver_checkpoint)
 .toTable(silver_table))


In [0]:
spark.table(silver_table).display()

In [0]:
generate_sales_data(5)  # Add more records

In [0]:
spark.table(silver_table).display()

### Stream Silver to Gold with Aggregation

In [0]:
# Read Silver as stream
silver_read_stream = spark.readStream.table(silver_table)

# Aggregate
gold_stream = (
    silver_read_stream
    .groupBy("region", "order_date")
    .agg(sum("quantity").alias("total_quantity"))
)

# Write to managed Gold table
(gold_stream.writeStream
 .outputMode("complete")  # Required for aggregations
 .option("checkpointLocation", gold_checkpoint)
 .toTable(gold_table))


In [0]:
spark.sql("SELECT * FROM gold_sales_tbl ORDER BY order_date, region;").display()

### Add More Sales Records

In [0]:
generate_sales_data(30)  # Simulate new incoming data

In [0]:
spark.sql("SELECT * FROM gold_sales_tbl ORDER BY order_date, region;").display()

In [0]:
listStream()

### Clean Up

In [0]:
cleanup()