##Silver Layer: Stores Data Transformation
**Notebook Objective:** 

This notebook implements a robust ETL pipeline to transform raw "Store" data from the Bronze layer into a validated, deduplicated, and geographically accurate Silver table. 

It ensures that only stores with valid IDs and realistic GPS coordinates are made available for downstream analytics.

##1. Initial Data Profiling & Geospacial Validation

In this preliminary step, we perform a "health check" on the raw data. Since store data is highly dependent on location, we verify that latitude and longitude values fall within global standards

In [0]:
%sql
-- 1. Check for coordinate validity (lat: -90 to 90, long: -180 to 180)
-- This helps identify 'impossible' locations before they reach the Silver layer.
SELECT 
  count(*) FILTER (WHERE latitude < -90 OR latitude > 90) as invalid_lat,
  count(*) FILTER (WHERE longitude < -180 OR longitude > 180) as invalid_long
FROM `vstone-catalog`.bronze_schema.bronze_stores;

-- 2. Check for missing store identifiers
-- Primary keys are non-negotiable for the Silver layer.
SELECT count(*) FROM `vstone-catalog`.bronze_schema.bronze_stores WHERE store_id IS NULL;

##2. Environment Configuration

We define the paths using Unity Catalog's three-tier namespace. We use backticks to handle special characters (hyphens) in the catalog name.

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

# --- 1. CONFIGURATION ---
# Backticks are required because of the hyphen in the catalog name
CATALOG = "`vstone-catalog`"
SILVER_SCHEMA = "silver_schema"

# Defining table paths for Bronze, Silver, and the Quarantine audit log.
BRONZE_TABLE = f"{CATALOG}.bronze_schema.bronze_stores"
SILVER_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.silver_stores"
QUARANTINE_TABLE = f"{CATALOG}.{SILVER_SCHEMA}.quarantine_stores"
# Create the schema if it doesn't exist to avoid initialization errors.
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SILVER_SCHEMA}")

##3. Standardization via Pandas UDF
To ensure high performance across the cluster, we use a Pandas UDF. 

This allows us to use efficient Python libraries (like Pandas) to perform vectorized string operations on Spark DataFrames.

In [0]:


# --- 3. PANDAS UDF FOR DATA STANDARDIZATION ---
@pandas_udf(StringType())
def standardize_text_udf(text_series: pd.Series) -> pd.Series:
    """Standardizes text to lowercase and removes trailing spaces."""
    return text_series.str.lower().str.strip()
# Load Bronze data and convert headers to snake_case for system compatibility.
df_bronze = spark.read.table(BRONZE_TABLE)

# Standardize column headers to snake_case
standardized_cols = [col.lower().replace(" ", "_") for col in df_bronze.columns]
df_standardized = df_bronze.toDF(*standardized_cols)

##4. Geo-Spatial Quality Gates & Quarantine
This is the core "logic gate" of the pipeline. Records that fail business rules (missing IDs or incorrect coordinates) are diverted to a Quarantine table rather than being deleted, preserving data for future troubleshooting.

In [0]:
# --- QUALITY GATES & QUARANTINE ---
# Logic: Latitude must be between -90/90 and Longitude between -180/180.
geo_valid = (F.col("latitude").between(-90, 90)) & (F.col("longitude").between(-180, 180))
id_valid = F.col("store_id").isNotNull()

valid_mask = geo_valid & id_valid

# Isolate malformed records
df_quarantine = df_standardized.filter(~valid_mask) \
    .withColumn("quarantine_reason", 
        F.when(~id_valid, "MISSING_STORE_ID")
         .otherwise("INVALID_GEO_COORDINATES")) \
    .withColumn("quarantined_at", F.current_timestamp())

# Filter clean data
df_clean = df_standardized.filter(valid_mask)


In [0]:
# --- DEDUPLICATION & NORMALIZATION ---
# Partition by store_id and rank by load_dt descending to find the latest record.

window_spec = Window.partitionBy("store_id").orderBy(F.col("load_dt").desc())

df_silver_final = df_clean.withColumn("row_rank", F.row_number().over(window_spec)) \
    .filter("row_rank == 1") \
    .drop("row_rank") \
    .withColumn("state", F.upper(F.col("state"))) \
    .withColumn("load_dt", F.to_timestamp(F.col("load_dt"))) \
    .withColumn("postal_code", F.col("postal_code").cast("bigint"))


##6. Atomic Delta Writes & Table Constraints
We commit the data to Delta Lake and apply hard constraints. These constraints act as a permanent firewall at the storage level, preventing any future invalid data from being inserted into this table.

In [0]:

# ---  ATOMIC WRITES ---
# Append failed records to Quarantine; Overwrite Silver with the fresh clean set.
df_quarantine.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable(QUARANTINE_TABLE)

# Write Silver (Overwrite)
df_silver_final.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(SILVER_TABLE)

# --- 8. APPLY DELTA CONSTRAINTS ---
# Note: Using backticks inside the SQL string for the table name is handled by the SILVER_TABLE variable
spark.sql(f"ALTER TABLE {SILVER_TABLE} CHANGE COLUMN store_id SET NOT NULL")

try:
    spark.sql(f"ALTER TABLE {SILVER_TABLE} ADD CONSTRAINT valid_latitude CHECK (latitude BETWEEN -90 AND 90)")
    spark.sql(f"ALTER TABLE {SILVER_TABLE} ADD CONSTRAINT valid_longitude CHECK (longitude BETWEEN -180 AND 180)")
except Exception as e:
    print(f"Constraints already exist or could not be applied: {e}")

print(f"Silver table {SILVER_TABLE} updated successfully.")

In [0]:
%sql
-- See who changed the table and when
DESCRIBE HISTORY `vstone-catalog`.silver_schema.silver_stores;



In [0]:
# %sql
# -- Restore the table to a previous state if an error occurred
# RESTORE TABLE `vstone-catalog`.silver_schema.silver_stores TO VERSION AS OF 2;

##Industry Logics & Standards 
**1. Geo-Spatial Integrity**
In retail analytics, mapping store locations is vital. By implementing BETWEEN -90 AND 90 checks, we ensure that map visualizations in the Gold layer don't break or show stores in the middle of the ocean due to data entry errors.

**2. The Power of Delta History**
The DESCRIBE HISTORY command used in this notebook is a key feature of Delta Lake. It allows data engineers to see a full audit trail of every change (who, when, what operation). If a bad update occurs, the RESTORE TABLE ... TO VERSION AS OF command allows for instant "Undo" functionality.

**3. Idempotency and Overwrites**
By using .mode("overwrite") with .option("overwriteSchema", "true"), the notebook is idempotent. This means you can run it multiple times, and it will always result in the same, clean "State of Truth" without duplicating data.

**4. Data Type Enforcement**
Casting postal_code to bigint and load_dt to timestamp ensures that the data is ready for mathematical operations and time-series analysis in the next stage of the Medallion pipeline.