# ü•à Phase 2: The Silver Layer (Cleaning & Transformation)
**Project:** "Olist-Next" Hyper-Personalized Retention Engine
**Layer:** Silver (Enterprise/Validated Data)

## üéØ Objectives
We transform the "Bronze" (Raw) data into "Silver" (Clean) tables by applying specific business rules:
1.  [cite_start]**Orders:** Filter for `order_status = 'delivered'` to focus on completed transactions  [cite_start]and cast timestamps.
2.  [cite_start]**Reviews:** Handle null values in text fields (filling with "No review text")[cite: 71].
3.  [cite_start]**Products:** Join with `category_translation` to provide English category names.
4.  **Passthrough:** Clean and standardize the remaining tables (Items, Customers, Payments) for the Gold layer.

### Setup (Python)
Set the catalog context.

In [0]:
from pyspark.sql.functions import col, to_timestamp, lit

# Set the Catalog Context
spark.sql("USE CATALOG olist_hackathon")

print("‚úÖ Context set to 'olist_hackathon'. Ready for Silver transformations.")

### Transformation 1 - Orders (Python)
Logic: Filter for 'delivered' and fix timestamp types.

In [0]:
def process_silver_orders():
    print("‚è≥ Processing Silver Orders...")
    
    # 1. Read Bronze
    df_orders = spark.table("bronze.orders")
    
    # 2. Apply Transformations
    df_cleaned = (df_orders
                  # Filter: Only completed orders are relevant for Churn/CLV 
                  .filter(col("order_status") == "delivered")
                  # Fix Type: Convert string timestamp to Spark TimestampType 
                  .withColumn("order_purchase_timestamp", 
                              to_timestamp(col("order_purchase_timestamp")))
                  .withColumn("order_delivered_customer_date", 
                              to_timestamp(col("order_delivered_customer_date")))
                  .withColumn("order_estimated_delivery_date", 
                              to_timestamp(col("order_estimated_delivery_date")))
                 )
    
    # 3. Write to Silver (Overwrite mode ensures idempotency)
    df_cleaned.write.format("delta").mode("overwrite").saveAsTable("silver.orders")
    print(f"‚úÖ silver.orders created. Count: {df_cleaned.count()}")

process_silver_orders()

### Transformation 2 - Reviews (Python)
Logic: Handle nulls in the text column.

In [0]:
def process_silver_reviews():
    print("‚è≥ Processing Silver Reviews...")
    
    df_reviews = spark.table("bronze.reviews")
    
    # Logic: Replace null messages with a placeholder text [cite: 71]
    # We use .fillna() specifically on the text column
    df_cleaned = df_reviews.fillna({"review_comment_message": "No review text", 
                                    "review_comment_title": "No Title"})
    
    df_cleaned.write.format("delta").mode("overwrite").saveAsTable("silver.reviews")
    print(f"‚úÖ silver.reviews created.")

process_silver_reviews()

### Transformation 3 - Products & Enrichment (Python)
Logic: Join Products with Category Translations to get English names.

In [0]:
def process_silver_products():
    print("‚è≥ Processing Silver Products (Enrichment)...")
    
    # Read both tables
    df_products = spark.table("bronze.products")
    df_trans = spark.table("bronze.category_translation").drop("ingestion_ts").drop("source_file")
    
    # Logic: Join to get English names 
    # We use a Left Join to ensure we don't lose products if a translation is missing
    df_joined = (df_products
                 .join(df_trans, "product_category_name", "left")
                 .drop("product_category_name") # Drop original Portuguese column
                 .withColumnRenamed("product_category_name_english", "category_name") # Rename to clean standard
                )
    
    # Drop _rescued_data column if it exists
    if "_rescued_data" in df_joined.columns:
        df_joined = df_joined.drop("_rescued_data")
    
    df_joined.write.format("delta").mode("overwrite").saveAsTable("silver.products")
    print(f"‚úÖ silver.products created with English names.")

process_silver_products()

### The "Passthrough" Tables (Python)
You need these tables for the Gold Layer (churn calculation), so we move them to Silver even without complex logic.

In [0]:
def process_passthrough_tables():
    # List of tables to move from Bronze to Silver as-is
    tables = ["order_items", "customers", "payments", "sellers", "geolocation"]
    
    for table in tables:
        # Robust Fix: Use the Fully Qualified Name (Catalog.Schema.Table)
        source_path = f"olist_hackathon.bronze.{table}"
        target_path = f"olist_hackathon.silver.{table}"
        
        print(f"‚è≥ Passthrough processing: {source_path} -> {target_path}...")
        
        try:
            # Read from specific source
            df = spark.table(source_path)
            
            # Write to specific target
            df.write.format("delta").mode("overwrite").saveAsTable(target_path)
            print(f"‚úÖ Created: {target_path}")
            
        except Exception as e:
            print(f"‚ö†Ô∏è Error processing {table}: {e}")
            print("üí° Tip: Run 'SHOW TABLES IN olist_hackathon.bronze' to check if this table exists.")

process_passthrough_tables()

### Verification (Python)
Proof of success: Check the English category names in the products table.

In [0]:
# Verify the Join
print("--- Silver Products Sample (English Categories) ---")
display(spark.table("olist_hackathon.silver.products").select("product_id", "category_name").limit(5))

# Verify the Orders Filter
print("--- Silver Orders Status Check (Should only be 'delivered') ---")
display(spark.sql("SELECT DISTINCT order_status FROM olist_hackathon.silver.orders"))