This notebook demonstrates a standard Medallion Architecture pattern, specifically focusing on the transition from the Bronze layer (raw data) to the Silver layer (cleaned, validated, and deduplicated data).

**Notebook Purpose:** This notebook implements the process to refine raw menu item data into a high-quality "Silver" table. It implements data quality checks, schema enforcement, and deduplication to ensure the data is "Ready for Analytics."

##1.Menuitems_silver table Creation

##1. Environment Setup & Configuration

define the naming conventions for our Delta tables and schemas using Unity Catalog naming standards (catalog.schema.table).

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType, StructType, StructField, LongType, DoubleType, BooleanType, TimestampType


# --- CONFIGURATION ---
# Using backticks for catalog names containing hyphens to comply with Spark SQL syntax
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"
BRONZE_TABLE = f"{CATALOG}.bronze_schema.bronze_menuitems"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_menuitems"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_menuitems"


In [0]:
# Bootstrap the environment by ensuring the destination schema exists

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")

##2. Table Definition & Schema Enforcement

Explicitly defining the Silver table schema ensures data integrity. We use the Delta Lake format to support ACID transactions and time travel.

In [0]:




# --- 2. EXPLICIT TABLE CREATION ---
# Ensure the table is created with backticks in the identifier
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {SILVER_TABLE} (
  category STRING,
  is_seasonal BOOLEAN,
  item_id BIGINT,
  item_name STRING,
  price DOUBLE,
  load_dt TIMESTAMP,
  source STRING
) USING DELTA
""")

print(f"Schema and Table {SILVER_TABLE} verified successfully.")

##3. Data Ingestion & Header Standardization

Raw data often comes with inconsistent naming (spaces, special characters, mixed case). This step standardizes headers to snake_case.

In [0]:

# --- PANDAS UDF FOR HEADER STANDARDIZATION ---
# Logic: Convert to lower, replace non-alphanumeric with underscores, strip trailing underscores
@pandas_udf(StringType())
def standardize_header_udf(col_name: pd.Series) -> pd.Series:
    return col_name.str.lower().str.replace(r'[^a-zA-Z0-9]', '_', regex=True).str.strip('_')



In [0]:
# Load raw data from Bronze
try:
    df_bronze = spark.read.table(BRONZE_TABLE)
except Exception as e:
    raise Exception(f"Bronze table not found: {e}")

# Apply standardization: industry standard is snake_case for database columns
standardized_cols = [col.lower().replace(" ", "_") for col in df_bronze.columns]
df_standardized = df_bronze.toDF(*standardized_cols)


##4. Data Quality & Quarantine Logic

In production, we never "drop" bad data. Instead, we Quarantine it. This allows data engineers to investigate the source of error without stopping the entire pipeline.

In [0]:

# --- QUARANTINE LOGIC (Malformed Records) ---
# Validation Rules: 
# 1. Primary Key (item_id) must exist.
# 2. Business Logic: Price must be a positive value.
valid_mask = (F.col("item_id").isNotNull()) & (F.col("price") > 0)

# Filter out failed records and tag them with a reason for easier debugging
df_quarantine = df_standardized.filter(~valid_mask) \
    .withColumn("quarantine_reason", 
        F.when(F.col("item_id").isNull(), "MISSING_ID")
         .otherwise("INVALID_PRICE")) \
    .withColumn("quarantined_at", F.current_timestamp())



##5. Deduplication & Transformation

The Silver layer must represent the "latest version of truth." We use Window functions to handle duplicate records for the same item_id.

In [0]:
# --- CLEAN & DEDUPE ---
# Partition by ID and order by timestamp to find the most recent record
window_spec = Window.partitionBy("item_id").orderBy(F.col("load_dt").desc())

df_silver_final = df_standardized.filter(valid_mask) \
    .withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1") \
    .drop("row_rank") \
    .withColumn("price", F.round(F.col("price").cast("double"), 2)) \
    .withColumn("load_dt", F.to_timestamp(F.col("load_dt"))) \
    .select("category", "is_seasonal", "item_id", "item_name", "price", "load_dt", "source")



##6. Atomic Writes & Expectations
Finally, we commit the data to the Delta tables and apply hard constraints to prevent future "garbage" data from entering via other processes.

In [0]:
# --- ATOMIC WRITES ---
# Append failed records to Quarantine for audit trails)
df_quarantine.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable(QUARANTINE_TABLE)
# Overwrite Silver table with the new "gold standard" clean set
df_silver_final.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(SILVER_TABLE)



In [0]:
# --- APPLY CONSTRAINTS ---
# Delta Constraints: These act as a firewall at the storage level

spark.sql(f"ALTER TABLE {SILVER_TABLE} CHANGE COLUMN item_id SET NOT NULL")
try:
    spark.sql(f"ALTER TABLE {SILVER_TABLE} ADD CONSTRAINT positive_price CHECK (price > 0)")
except:
    pass # Constraint already exists

print(f"Successfully processed and updated: {SILVER_TABLE}")

##Industry Context & Logic

**1. Why the Medallion Architecture?**

In modern data engineering (Databricks/Spark), we use the Medallion Architecture:

- Bronze: Raw, unvalidated data.

- Silver (This Notebook): Cleaned, filtered, and deduplicated. It is the "Source of Truth" for Data Scientists.

- Gold: Aggregated data for Business Intelligence (BI) dashboards.

**2. Why Quarantine instead of Deleting?**

If you simply delete records where price < 0, you lose visibility into upstream bugs. By writing to a quarantine_menuitems table, you provide a dashboard for data quality where engineers can see: "Yesterday, 5% of our items had no IDs; we need to check the POS system source.

**3. Why Use Delta Constraints?**
Standard Spark dataframes are "schema-on-read." 
By using ALTER TABLE ... ADD CONSTRAINT, we turn the Delta table into a "schema-on-write" system.

This mimics traditional SQL Server/Oracle behavior, ensuring that no future notebook can accidentally write a negative price into our clean Silver table.

**4. Deduplication Logic**

The use of Window.partitionBy("item_id").orderBy(F.col("load_dt").desc()) is the industry-standard way to handle Late Arriving Data or duplicates.

If a record for "Coffee" is sent twice, this logic ensures we only keep the one with the most recent load_dt.