# Module 2: PySpark Transformations - Banking Data Cleansing
**Scenario:** Working for a FinTech / Banking Client (e.g., Chase, Wells Fargo, PayPal).

**Objective:** Clean dirty transactional data. Real-world data is never clean; it has negative amounts, null IDs, and weird strings.

**The "Silver Layer" Concept:**
In a Modern Data Lakehouse (Databricks/Delta Lake), we have 3 layers:
1.  **Bronze (Raw):** The data exactly as it came from the source (csv/json).
2.  **Silver (Cleaned):** Data with types fixed, nulls handled, and duplicates removed. **<- WE ARE HERE**
3.  **Gold (Aggregated):** Business-level reports (e.g., Monthly Sales).

---
## 1. Setup Environment

In [None]:
# Setup PySpark
try:
    import pyspark
    print("PySpark is already installed")
except ImportError:
    print("Installing PySpark...")
    !pip install pyspark findspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("Banking_Data_Cleanser") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Ready")

## 2. Load "Dirty" Raw Data
We will generate a specialized "dirty" dataset containing common banking data issues:
*   **Duplicate** transactions.
*   **Negative** amounts (e.g., -500.0 instead of 500.0).
*   **Null** Customer IDs.
*   **Inconsistent** Currency codes ('USD', 'usd', ' Usd ').

In [None]:
# --- Create Dirty Transaction Data ---
txn_data = [
    ("TXN001", "CUST_A", "2023-01-01", 1000.0, "USD"),
    ("TXN002", "CUST_B", "2023-01-02", -500.0, "usd"),     # Issue: Negative Amount, Lowercase Currency
    ("TXN003", None, "2023-01-03", 250.0, "EUR"),          # Issue: Missing Customer
    ("TXN004", "CUST_C", "2023-01-03", 100.0, " Usd "),    # Issue: Spaces in Currency
    ("TXN001", "CUST_A", "2023-01-01", 1000.0, "USD"),     # Issue: Exact Duplicate of TXN001
    ("TXN005", "CUST_D", "2023/01/05", 1200.0, "USD"),     # Issue: Bad Date Format
    ("TXN006", "CUST_E", "2023-01-06", None, "USD")        # Issue: Null Amount
]

schema = ["txn_id", "customer_id", "txn_date", "amount", "currency"]
df_raw = spark.createDataFrame(txn_data, schema=schema)

print("--- Raw / Dirty Data ---")
df_raw.show()

print("\n--- Summary of Raw Data issues ---")
# describe() is great for seeing mean/max/min (helps catch negatives)
df_raw.describe(["amount"]).show()

## 3. Cleaning Task 1: Handling Duplicates & Nulls
**Rule:**
1.  If `customer_id` is missing, we cannot trace the transaction (Audit Risk). **Drop** these rows.
2.  If `amount` is missing (Null), replace with **0.0** (Default Value).
3.  Remove exact duplicate rows.

In [None]:
# 1. Drop Duplicates
df_no_dupes = df_raw.dropDuplicates(["txn_id"]) # Deduplicate based on Unique Key

# 2. Drop rows where customer_id is NULL
# subset parameter tells Spark which column to check for nulls
df_valid_cust = df_no_dupes.dropna(subset=["customer_id"])

# 3. Fill Null Amounts with 0.0
# Only fills columns that match the type (double)
df_filled = df_valid_cust.fillna(0.0, subset=["amount"])

print("--- After Removing Duplicates & Handling Nulls ---")
df_filled.show()

## 4. Cleaning Task 2: Data Standardization (Fixing Values)
**The Problem:**
*   `currency` is messy ("usd", "USD", " Usd ").
*   `amount` is negative (-500.0). Transactions in this table should be absolute values.
*   `txn_date` has mixed formats.

**Tools:**
*   `trim()`: Removes leading/trailing spaces.
*   `upper()`: Converts to uppercase.
*   `when().otherwise()`: Like IF-ELSE in SQL/Excel.
*   `abs()`: Absolute value.

In [None]:
# 1. Clean Currency: Trim spaces -> Convert to Upper Case
df_std_currency = df_filled.withColumn("currency_clean", upper(trim(col("currency")))) \
                           .drop("currency") # Drop the old messy column

# 2. Fix Negative Amounts: Use abs() or when()
# If amount < 0, multiply by -1
df_std_amount = df_std_currency.withColumn("amount_clean", abs(col("amount"))) \
                               .drop("amount")

# 3. Standardize Date
# Spark 3.0+ introduced strict Prohibitive Date Parsing.
# Even with LEGACY mode, some environments struggle.
# The safest modern way is to use `to_date` but handling formats carefully.

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

# We will check specifically for the format that is failing '2023/01/05'
# In newer Spark versions, yyyy/MM/dd might require specific handling or clean strings first.

df_clean_final = df_std_amount.withColumn("txn_date_clean",
    coalesce(
        to_date(col("txn_date"), "yyyy-MM-dd"),
        to_date(col("txn_date"), "yyyy/MM/dd"),
        to_date(col("txn_date"), "MM/dd/yyyy") # Adding another potential format just in case
    )
).drop("txn_date")

print("--- Final Cleaned Data (Standardized) ---")
df_clean_final.show()

# Notice: All currencies are "USD", no negative amounts, dates are uniform YYYY-MM-DD.

## 5. Write to Parquet (The Industry Standard)
Why Parquet?
1.  **Compression:** 1TB CSV -> ~100GB Parquet. Huge cost savings on Cloud Storage (S3/Azure Blob).
2.  **Speed:** It is Columnar. If you select just `amount`, it only reads that one column, skipping the rest. CSV reads everything.
3.  **Schema Preservation:** Remembers that `amount` is a Double, not a String.

**Action:** Write the clean DataFrame to a folder named `clean_transactions`.

In [None]:
output_path = "clean_transactions"

# Mode 'overwrite' replaces existing data. 'append' adds to it.
df_clean_final.write.mode("overwrite").parquet(output_path)

print(f"Data successfully written to {output_path}")

# --- Verification (Read it back) ---
df_read_back = spark.read.parquet(output_path)
print("\n--- Verified Data from Parquet ---")
df_read_back.show()