# Silver Layer: Data Cleaning & Transformation

## Purpose
Read raw transaction data from the Bronze Delta table, apply cleaning and enrichment
transformations, and write to the Silver Delta table.

## Data Flow
```
Bronze Delta Table (Raw) -> Filter -> Enrich -> Deduplicate -> Silver Delta Table (Clean)
```

## Outputs
- **Table:** `fraud_lakehouse_workspace.default.silver_transactions`
- **Format:** Delta Lake
- **New Columns:** transaction_date, transaction_hour, is_high_value, is_fraud, amount_category

## 1. Cleanup (Optional)

Run this cell to **reset the Silver table and checkpoint** before a fresh start.
Skip this cell if you want to keep existing data.

In [0]:
# 1. Stop active streams
for s in spark.streams.active:
    s.stop()
print("All active streams stopped.")

# 2. Drop the Silver table
try:
    spark.sql("DROP TABLE IF EXISTS fraud_lakehouse_workspace.default.silver_transactions")
    print("Silver table dropped.")
except Exception as e:
    print(f"Failed to drop Silver table: {e}")

# 3. Clean the checkpoint directory
silver_checkpoint = "YOUR_CHECKPOINT_PATH_HERE" # Replace with your actual checkpoint path, e.g., "dbfs:/fraud_lakehouse_workspace/checkpoints/silver_transactions"
try:
    dbutils.fs.rm(silver_checkpoint, recurse=True)
    print("Checkpoint cleared.")
except Exception as e:
    print(f"Failed to clear checkpoint: {e}")

print("Table and checkpoint deleted. Ready for a fresh start.")

## 2. Silver Layer - Read, Transform & Write

This cell performs the full Silver pipeline:
- **Filter:** Remove rows with null Amount or Class
- **Enrich:** Add transaction_date, transaction_hour, is_high_value, is_fraud, amount_category
- **Deduplicate:** Remove duplicates based on eventhub_sequence and Time
- **Select:** Pick only the columns needed downstream

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Define checkpoint
silver_checkpoint = "YOUR_CHECKPOINT_PATH_HERE" # Replace with your actual checkpoint path, e.g., "dbfs:/fraud_lakehouse_workspace/checkpoints/silver_transactions"

print("Building Silver Layer...")

try:
    # 1. Read from Bronze (streaming)
    df_bronze_stream = spark.readStream \
        .option("ignoreDeletes", "true") \
        .table("fraud_lakehouse_workspace.default.bronze_transactions")

    # 2. Silver Transformations
    df_silver = df_bronze_stream \
        .filter(col("Amount").isNotNull()) \
        .filter(col("Class").isNotNull()) \
        .withColumn("transaction_date", 
                    to_date(from_unixtime(col("Time")))) \
        .withColumn("transaction_hour", 
                    hour(from_unixtime(col("Time")))) \
        .withColumn("is_high_value", 
                    when(col("Amount") > 1000, 1).otherwise(0)) \
        .withColumn("is_fraud", 
                    when(col("Class") == 1, 1).otherwise(0)) \
        .withColumn("amount_category",
                    when(col("Amount") < 10, "Small")
                    .when((col("Amount") >= 10) & (col("Amount") < 100), "Medium")
                    .when((col("Amount") >= 100) & (col("Amount") < 1000), "Large")
                    .otherwise("Very Large")) \
        .withColumn("silver_processed_time", current_timestamp()) \
        .dropDuplicates(["eventhub_sequence", "Time"]) \
        .select(
            # Transaction details
            col("Time"),
            col("transaction_date"),
            col("transaction_hour"),
            col("Amount"),
            col("Class"),
            col("is_fraud"),
            col("is_high_value"),
            col("amount_category"),
            # PCA features (V1-V28)
            *[col(f"V{i}") for i in range(1, 29)],
            # Metadata
            col("eventhub_enqueued_time"),
            col("eventhub_offset"),
            col("eventhub_sequence"),
            col("bronze_ingestion_time"),
            col("silver_processed_time")
        )

    # 3. Write to Silver Delta Table
    silver_query = df_silver.writeStream \
        .format("delta") \
        .outputMode("append") \
        .option("checkpointLocation", silver_checkpoint) \
        .option("mergeSchema", "true") \
        .toTable("fraud_lakehouse_workspace.default.silver_transactions")

    print("Silver layer streaming!")
    print(f"Stream ID: {silver_query.id}")
except Exception as e:
    print(f"Silver layer setup failed: {e}")
    raise

## 3. Alternative: Simplified Write (Optional)

A simplified version of the Silver write with minimal transformations.
Use this if the full transformation above encounters issues or for a quick test.

> **Note:** Only run this if the full pipeline above has not been started,
> or after running the cleanup cell first.

In [0]:
from pyspark.sql.functions import col, when, current_timestamp

try:
    # 1. Read Stream from Bronze
    df_bronze_stream = spark.readStream \
        .option("ignoreDeletes", "true") \
        .table("fraud_lakehouse_workspace.default.bronze_transactions")

    # 2. Transformation
    df_silver = df_bronze_stream \
        .filter(col("Amount").isNotNull()) \
        .withColumn("is_high_value", when(col("Amount") > 1000, 1).otherwise(0)) \
        .withColumn("silver_processed_time", current_timestamp()) \
        .dropDuplicates(["eventhub_sequence"])

    # 3. Write Stream (This will auto-create the table)
    silver_query = df_silver.writeStream \
        .format("delta") \
        .outputMode("append") \
        .option("checkpointLocation", silver_checkpoint) \
        .toTable("fraud_lakehouse_workspace.default.silver_transactions")

    print("Simplified Silver table created and streaming!")
except Exception as e:
    print(f"Simplified Silver write failed: {e}")
    raise

## 4. Verify: Bronze vs Silver Record Counts

Compare the total record counts in Bronze and Silver to confirm
data is flowing correctly through the pipeline.

In [0]:
try:
    bronze_count = spark.table("fraud_lakehouse_workspace.default.bronze_transactions").count()
    silver_count = spark.table("fraud_lakehouse_workspace.default.silver_transactions").count()

    print(f"Bronze: {bronze_count}")
    print(f"Silver: {silver_count}")

    if bronze_count == silver_count:
        print("PERFECT MATCH - counts are equal.")
    else:
        print("Counts differ - Bronze may still be appending while Silver is running.")
except Exception as e:
    print(f"Verification failed: {e}")
    raise

## Summary

**Silver Layer Status:**
- Read from Bronze Delta table (streaming)
- Filtered null Amount and Class values
- Enriched with transaction_date, transaction_hour, is_high_value, is_fraud, amount_category
- Deduplicated on eventhub_sequence and Time
- Written to Silver Delta table with schema evolution enabled

**Next Step:** Run `04_gold_layer` to build analytics and aggregation tables.