# Medallion Architecture Data Cleaning Pipeline 

Delta Live Tables offer a fault-tolerant, optimized approach for building reliable data pipelines, making them ideal for this use case.

In the real world, roles & responsibilities of E2E data projects are as shown: 
- **Data Engineers**: Focus on building pipelines that handle common data issues such as duplicates, formatting of columns, schema definition, and invalid values.

- **Data Scientists**: Work on EDA, imputing missing values, handling outliers, and preparing data for modeling (feature engineering / selection / dimensionality reduction etc).

In this notebook, I will be implementing a simplified **Medallion Architecture** using **Delta Live Tables (DLT) 
 in Azure Databricks** to simulate real-world data engineering practices. 

I will be using the following visualisation as a guide to build the data pipeline. 


<br>

<img src="https://media.datacamp.com/cms/ad_4nxe4oejrhu9gexxri3ea6vmsu1fgxcxbvlwmbaj4ji5s2u31dg3hbyyg4sxmd7ma8-9zamnbxadzz_h4kllvjylicug3v4-iinvx65erdijn4htymmqvc3mjqblskqzdu5ttmodyua.png">



By the end of this notebook, I should be able to: 
- Output a **thoroughly cleansed target dataset** ready for data scientists' to conduct EDA, dataset preprocessing and other model building practices. 

- Define **feature and target variables** from the target table clearly 

## Import Libraries

In [0]:
from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws,  regexp_extract
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)



## Bronze Delta Table

This serves as a 'landing place' for raw data for single-source of truth purposes. In case data processing in subsequent stages go faulty, data specialists can use the **Bronze Delta Table** for reference, ensuring data integrity. 



In [0]:


import dlt

# This will only be allowed if I can create a DLT pipeline (not allowed due to Azure for Students)

# @dlt.table(name="bronze_raw_lendingclub_data", comment="Ingest raw loan data from Lending Club csv")
# def bronze_raw_loans():
#     return spark.read.csv("/FileStore/tables/accepted_2007_to_2018Q4.csv", 
#                           header=True, 
#                           inferSchema=True)
    
# I will need to ensure inferSchema = True, so that all columns dtypes are auto-detected to lessen my workload later 

# ✅ The below allows DLT pipeline not to be created 
bronze_df = (
    spark.read
    .option("header", True)
    .option("inferSchema", True)
    .csv("/FileStore/tables/accepted_2007_to_2018Q4.csv")
)

spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")

# ✅ 2. Save as a Delta table in the `bronze` schema
bronze_df.write.format("delta").mode("overwrite").saveAsTable("bronze.lendingclub_raw")


## Silver Delta Table

Next, the pipeline to produce a Silver Delta Table will mainly perform key data cleaning steps.
  - Deal with Duplicates
  - Remove String Column Spaces
  - Handle String Formatting / Spelling Issues 
  - Ensure UTF-8 for String Columns 
  - Schema Definition 
  - Invalid Value Handling 

### String Columns Cleaning

These cleaning steps will take reference from **sandbox/string_issues** for specific cleaning steps to maintain data integrity.

In [0]:
def drop_duplicates(df):
    duplicate_rows = df.count() - df.dropDuplicates().count()

    return df.dropDuplicates()

def handle_string_cols_spaces(df): 
    string_cols = [
        field.name for field in df.schema.fields
        if isinstance(field.dataType, StringType)]
    
    # Replaces each existing column with new <string> values which are trimmed 
    for col_name in string_cols:
        df = df.withColumn(col_name, trim(col(col_name)))
    
    return df 

def handle_string_cols_formatting(df):  
    """
    Uses library of RapidFuzz to provide lightweight similarity calculations, optimised for performance

    Takes reference from String issues are in ../sandbox/string_issues.ipynb
    """

    print(f"Original Number of Rows: {df.count()}. ")
    
    # # 1. Drops unusable String columns 
    # unusable_cols = ["emp_title",
    # "hardship_type",
    # "verification_status_joint",
    # "hardship_status",
    # "deferral_term",
    # "hardship_length",
    # "hardship_loan_status",
    # "settlement_status", "annual_inc_joint", 'dti_joint']
    # df = df.drop(*unusable_cols) # allows dropping of multiple columns

    # 2. Fix addr_state (check len() > 2)
    df = df.filter(length( col('addr_state') ) == 2)

    # 3. Fix invalid string column values 

    # invalid_entries_list = [ "application_type",
    #     "policy_code",
    #     "home_ownership",
    #     "verification_status",
    #     "loan_status",
    #     "pymnt_plan",
    #     "initial_list_status",
    #     "hardship_flag"]
    
    df = df.filter(
        col("application_type").isin("Individual", "Joint App") &
        col("policy_code").isin("1.0", "0.0") &  # stored as string initially
        col("home_ownership").isin("MORTGAGE", "RENT", "OWN", "ANY", "OTHER") &
        col("verification_status").isin("Source Verified", "Not Verified", "Verified") &
        col("loan_status").isin(
            "Fully Paid", "Current", "Charged Off",
            "Late (31-120 days)", "In Grace Period",
            "Late (16-30 days)", "Default"
        ) &
        col("pymnt_plan").isin("n", "y") &
        col("initial_list_status").isin("f", "w") &
        col("hardship_flag").isin("N", "Y")
    )

    df = df.withColumn("policy_code", when(col("policy_code") == "1.0", 1).when(col("policy_code") == "0.0", 0))

    # 4. Drop meaningless string columns 
    meaningless_columns = [
    "url", "desc", "title", "zip_code", "purpose"]
    df = df.drop(*meaningless_columns)

    return df

def cast_string_to_numeric_cols(df): 
    numeric_columns = [
    "id", "member_id", "annual_inc", "annual_inc_joint",
    "dti", "dti_joint", "delinq_2yrs", "fico_range_low", "fico_range_high",
    "inq_last_6mths", "mths_since_last_delinq", "mths_since_last_record",
    "open_acc", "pub_rec", "revol_bal", "revol_util", "total_acc", 
    "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", 
    "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", 
    "collection_recovery_fee", "last_fico_range_high", "last_fico_range_low", 
    "collections_12_mths_ex_med", "mths_since_last_major_derog",
    "acc_now_delinq", "tot_coll_amt", "loan_amnt", "funded_amnt", 
    "funded_amnt_inv", "installment", "tot_cur_bal", "total_rev_hi_lim", 
    "inq_fi", "hardship_amount", "hardship_dpd", "orig_projected_additional_accrued_interest",
    "hardship_payoff_balance_amount", "hardship_last_payment_amount", 
    "settlement_amount", "settlement_percentage", "settlement_term", 
    "avg_cur_bal", "total_bal_il", "bc_util", "il_util", "total_cu_tl",
    "max_bal_bc", "percent_bc_gt_75", "total_bal_ex_mort", "all_util", 
    "open_acc_6m", "open_act_il", "open_il_12m", "last_pymnt_amnt", "open_il_24m", "mths_since_rcnt_il", 
    "open_rv_12m", "open_rv_24m", 'emp_length', 'term']

    # Deal with Type Casting to Numeric Data 
    int_cols = ['id', 'member_id']

    for column in numeric_columns: 
        if column in int_cols: 
            df = df.withColumn(column, col(column).cast(IntegerType()))
        elif column == 'emp_length':
            # Convert emp_length to integer values
            df = df.withColumn(
                "emp_length",
                when(col("emp_length").rlike("10\\+"), 10)
                .when(col("emp_length").rlike("< 1"), 0)
                .otherwise(
                    regexp_extract(col("emp_length"), r"(\d+)", 1).cast("int") # extracts 1st regex group digit from string 
                )
            )
        elif column == 'term': 
            df = df.withColumn("term",regexp_extract(col("term"), r"(\d+)", 1).cast("int"))

        else: 
            df = df.withColumn(column, col(column).cast(DoubleType()))

    return df 

def cast_string_to_date_cols(df):
    date_columns = [
    "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d", "next_pymnt_d",
    "sec_app_earliest_cr_line", "hardship_start_date", "hardship_end_date", 
    "payment_plan_start_date", "debt_settlement_flag_date", "settlement_date"]

    # Clean and cast each column
    for date_col in date_columns:
        # Format is 'MMM-yyyy' → Add dummy day '01' → Convert to 'yyyy-MM-dd'
        df = df.withColumn(
            date_col,
            to_date(concat_ws("-", col(date_col), lit("01")), "MMM-yyyy-dd")
        )

    return df 


In [0]:
# Step 0: Read from Bronze Table 
bronze_df_copy = spark.read.table("bronze.lendingclub_raw")

# Step 1: Drop duplicate rows
print(f"Original number of rows: {bronze_df_copy.count()}\n")

df_cleaned = drop_duplicates(bronze_df_copy)
print('✅ Duplicates removed...')


# Step 2: Trim spaces in all string columns
df_cleaned = handle_string_cols_spaces(df_cleaned)
print('✅ Trailing / Leading Spaces removed...')

# Step 3: Filter & fix invalid string formatting
df_cleaned = handle_string_cols_formatting(df_cleaned)
print('✅ Invalid String Column Formatting Settled & Meaningless Columns Dropped ...')

# Step 4: Type Casting
df_cleaned = cast_string_to_numeric_cols(df_cleaned)
df_cleaned = cast_string_to_date_cols(df_cleaned)
print('✅ String Columns Correctly Type Casted...\n')


print(f"New number of rows: {df_cleaned.count()}")

# Step 5: Save as Silver Delta Table 1 (Cleaned Strings Version)
spark.sql("CREATE SCHEMA IF NOT EXISTS silver")
df_cleaned.write.format("delta").mode("overwrite").saveAsTable("silver.lendingclub_cleaned_string")


### Numeric Columns Cleaning
These cleaning steps will take reference from **sandbox/numeric_issues** for specific cleaning steps to maintain data integrity

In [0]:
def clear_invalid_numerical_entries(df):
    df = df.filter( ~(col('dti') < 0  ))
    df = df.filter( ~ (col('total_rec_late_fee') < 0  ))

    return df

    

In [0]:
silver_table1 = spark.read.table("silver.lendingclub_cleaned_string")
print(f"Original number of rows: {silver_table1.count()}\n")

silver_table2 = clear_invalid_numerical_entries(silver_table1)
print('✅ Invalid Numerical Values Settled...')

print(f"Final number of rows: {silver_table2.count()}\n")

In [0]:
display(silver_table2.limit(10))

In [0]:
# Step 5: Save as Silver Delta Table 2 (Cleaned Strings Version)
silver_table2.write.format("delta").mode("overwrite").saveAsTable("silver.lendingclub_cleaned_numeric")

silver_table2.limit(10).display()

## Gold Delta Table 
Finally, to produce a Gold Delta Table, I will need to sort the dataset in chronological order. 

For credit risk modeling, banks use past data loan data to predict future defaults / metrics. As such, we want our dataset to be sorted in **chronological order**, so that built models are trained on older data, and tested on newer data **(out-of-time split)**. 

There should not be random splitting of data **(out-of-sample split)**, e.g. `train-test-split` from `sklearn` since credit-risk modeling is a **time-series problem**.

Hence, I will be sorting the dataset right from the start. 

By producing the Gold Delta Table, the subsequent jobs would require data scientists to impute missing values, conduct feature engineering and dimensionality reduction for accurate credit risk modeling.


In [0]:
silver_table2 = spark.read.table("silver.lendingclub_cleaned_numeric")

gold_df = silver_table2.orderBy(["issue_d"], ascending=True)



In [0]:
# Step 5: Save as Silver Delta Table 2 (Cleaned Strings Version)
spark.sql("CREATE SCHEMA IF NOT EXISTS gold")

gold_df.write.format("delta").mode("overwrite").saveAsTable("gold.medallion_cleaned_lc_data")


In [0]:
# Check if Gold Delta is accessible to data scientists
gold_table = spark.read.table('gold.medallion_cleaned_lc_data')
gold_table.limit(10).display()

As seen, the `issue_d` column wasn't sorted in order even after I have conducted the sorting before saving it as a Gold Delta Table. After researching, I realised that the distributed computing environment in PySpark prevents this from happening. To counter this, data scientists will have to take note to sort by the data before any machine learning model building can happen. 