# String Formatting EDA (Lending Club)
This notebook aims to use **RapidFuzz** to quickly find out issues in string columns in the Bronze Delta Table Lending Club Dataset. After finding out such issues, string formatting issues will then be resolved via the Medallion Architecture Data Cleaning Pipeline. 

## 1. Import Required Libraries

In [0]:
# Import Libraries 

from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType
)

from rapidfuzz import process

## 2. Read Data 

In [0]:
# To find which catalog I am currently working in 
spark.sql("SELECT current_catalog(), current_database()").show()

spark.sql("SHOW TABLES IN spark_catalog.default").show()



In [0]:
# Read from Bronze Delta Table (Uncleaned & Raw)
df = spark.read.table("bronze.lendingclub_raw")

df.printSchema() 

# Adjust the path to wherever your raw Lending Club Delta table sits
# raw_df = spark.read.format("delta").load("/mnt/bronze/lending_club")


## 3. Identify all String Columns 

In [0]:
all_string_cols = [
    f.name
    for f in df.schema.fields
    if isinstance(f.dataType, StringType)
]
print("String columns:", all_string_cols)


print(f"Number of String Columns: {len(all_string_cols)}")


## 4. Identify Number of Distinct Values Per String Column
The following code prints how many distinct values each string column has, so we can decide which columns we can tackle with fuzzy grouping with Rapid Fuzz (more computationally expensive) and which allows manual cleaning (lower effort and cost). 



In [0]:
for col_num, col_name in enumerate(all_string_cols, 1): # starts from col_num = 1 
    distinct_count = df.select(col_name).distinct().count() 
    print(f"{col_num:02d}. {col_name} -> {distinct_count:6d} distinct values" ) #:6d is adding space and right-aligning it or sth


From the results above, we shall define a list of columns whereby they should be skipped for fuzzy string matching. This is because they are numeric or date columns or meaningless columns inherently. 

In [0]:
numeric_columns = [
    "id", "member_id", "emp_length", "annual_inc", "annual_inc_joint",
    "dti", "dti_joint", "delinq_2yrs", "fico_range_low", "fico_range_high",
    "inq_last_6mths", "mths_since_last_delinq", "mths_since_last_record",
    "open_acc", "pub_rec", "revol_bal", "revol_util", "total_acc", 
    "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", 
    "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", 
    "collection_recovery_fee", "last_fico_range_high", "last_fico_range_low", 
    "collections_12_mths_ex_med", "mths_since_last_major_derog",
    "acc_now_delinq", "tot_coll_amt", "loan_amnt", "funded_amnt", 
    "funded_amnt_inv", "installment", "tot_cur_bal", "total_rev_hi_lim", 
    "inq_fi", "hardship_amount", "hardship_dpd", "orig_projected_additional_accrued_interest",
    "hardship_payoff_balance_amount", "hardship_last_payment_amount", 
    "settlement_amount", "settlement_percentage", "settlement_term", 
    "avg_cur_bal", "total_bal_il", "bc_util", "il_util", "total_cu_tl",
    "max_bal_bc", "percent_bc_gt_75", "total_bal_ex_mort", "all_util", 
    "open_acc_6m", "open_act_il", "open_il_12m", "last_pymnt_amnt", "open_il_24m", "mths_since_rcnt_il", 
    "open_rv_12m", "open_rv_24m", 'emp_length', 'term'
]


date_columns = [
    "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d", "next_pymnt_d",
    "sec_app_earliest_cr_line", "hardship_start_date", "hardship_end_date", 
    "payment_plan_start_date", "debt_settlement_flag_date", "settlement_date"
]

meaningless_columns = [
    "url", "desc", "title", "zip_code", "purpose"
]



## 5. Show Top Frequent N Values Per String Column 
Before doing any fuzzy grouping, we can see the top frequent values per column in strings. An example would be `home_ownership` showing 'rent', 'Rent' and 'RENT '. This will affect subsequent analysis and skew machine learning models. 

As such, the following approach allows me to check if a column has obvious variants, guiding our decisions on which columns to cluster with RapidFuzz to reduce computational expenses. 


In [0]:
def show_raw_top_values(df, column, top_n=30):
    print(f"\nTop {top_n} raw values for column '{column}':")
    (
        df
        .groupBy(col(column))
        .count()
        .orderBy(desc("count"))
        .limit(top_n)
        .show(n=top_n, truncate=False) # by default, shows 20 entries only
    )


# Combine all columns to skip
skip_cols = set(numeric_columns + date_columns + meaningless_columns)

# Loop through only valid string columns not in skip list
for col_name in all_string_cols:
    if col_name not in skip_cols:
        show_raw_top_values(df, col_name, top_n=30)




From the results above, we have grouped faulty entries which require cleaning. This is to streamline data cleaning later on. 

**Invalid Entries**
- **application_type**: Drop entries that are not Individual or Joint App.
- **home_ownership**: Drop anything that is not MORTGAGE, RENT, OWN, ANY, OTHER 
- **verification_status** : Drop anything that is not Source Verified, Not Verified, Verified 
- **loan_status**: Has legacy system entries 
    - **'Does not meet credit policy'**: Shows that loan was not approved. There shouldn't be records for this, or rather, the credit risk modeling project aims to predict PD, LGD, EAD based on approved loans. 
    - Given the data points for these are extremely small, they shall be **dropped** 
    - Valid Entres: Fully Paid, Current, Charged Off, Late (31-120 days), In Grace Period, Later (16-30 days), 'Default' 
- **pymnt_plan**: Drop all entries that is not 'n' and 'y'
- **initial_list_status**: Drop all entries that are not 'w' and 'f'
- **policy_code**: Drop all entries that are not 1.0 and 0.0, and change to 1 and 0 

- **hardship_flag**: Drop all entries that are not N or Y


**Rapid Fuzz: String Columns**
- **emp_title**
- **sub_grade**
- **addr_state**

**Unusable Columns**: 
- **hardship_type**: Many null values (2 million+)
- **verification_status_joint**: Drop all entries that are not Verified, Individual, Not Verified, or Source Verified (Many Null Values - 2 million records)
- **hardship_status**: 2 million records Null 
- **deferral_term**
- **hardship_length**
- **hardship_loan_status**
- **settlement_status**

## 6. RapidFuzz String Similarity Matching Function
Now that we have drilled down to the columns that have inconsistent spelling and formats, we would want to group similar strings together and map similar variants to 1 single string value. 

RapidFuzz clusters items that exceed the similarity threshold. It is a library that is better than FuzzyWuzzy for string matching, which allows better performance in big data environments. 


In [0]:
def get_fuzzy_groups(distinct_values, threshold):
    """
    Given a Python list of distinct strings from a column, return a dict mapping each string
    to a chosen 'canonical' string (first occurence). Two strings with similarity ≥ threshold
    collapse into the same 'canonical string'.

    Output:
    {
        "Rent": "Rent",
        "RENT": "Rent",
        "rent": "Rent",
        "Own": "Own",
        "own": "Own",
        "mortgag": "mortgag",
        "Mortgage": "mortgag"
    }

    """

    mapping_dict = {}
    canonical_list = []

    for val in distinct_values:
        if not val or val.strip() == "":
            continue

        result = process.extractOne(val, canonical_list, score_cutoff=threshold)
        if result:
            match, score, _ = result
            mapping_dict[val] = match
        else:
            canonical_list.append(val)
            mapping_dict[val] = val

    return mapping_dict

### 6.1 RapidFuzz String Matching Test (1 column)

The following code demonstrates how fuzzy matching works, for learning purposes. 

In [0]:
from collections import defaultdict

# Choose one column you want to test fuzzy grouping on
example_column = "emp_title"

# Step 1: All Row Objects in this column 
distinct_vals = df.select(example_column).where(col(example_column).isNotNull()).distinct().collect()

# Step 0: Keep only top 5,000 most common entries
top_vals = (
    df.groupBy(example_column)
    .count()
    .orderBy("count", ascending=False)
    .limit(5000)  # try 1000 or 5000 for speed
    .where(col(example_column).isNotNull())
    .rdd.flatMap(lambda x: [x[0]])
    .collect()
)

# Run fuzzy grouping on top frequent raw values
threshold = 98
mapping_dict = get_fuzzy_groups(top_vals, threshold=threshold)


fuzzy_groups = defaultdict(list)
for original, canonical in mapping_dict.items():
    fuzzy_groups[canonical].append(original)

print(f"\nFuzzy groups for column: '{example_column}' (threshold = {threshold})")
for canonical, members in fuzzy_groups.items():
    if len(members) > 1:
        print(f"  Canonical: '{canonical}' ← {members}\n")
    else:
        print(f"  (Solo) '{canonical}'")


### 6.2 RapidFuzz Fuzzy Grouping (String Columns) 
We will be checking similar strings in each of the necessary columns as shown below. 

In [0]:

for column in ['emp_title', 'sub_grade', "addr_state"]: 
    print('Current Col: ' + column + '\n')
    # Choose one column you want to test fuzzy grouping on
    example_column = column

    # Step 1: All Row Objects in this column 
    distinct_vals = df.select(example_column).where(col(example_column).isNotNull()).distinct().collect()

    # Step 0: Keep only top 5,000 most common entries
    top_vals = (
        df.groupBy(example_column)
        .count()
        .orderBy("count", ascending=False)
        .limit(5000)  # try 1000 or 5000 for speed
        .where(col(example_column).isNotNull())
        .rdd.flatMap(lambda x: [x[0]])
        .collect()
    )
    
    top_vals_cleaned = [val.lower().strip() for val in top_vals if isinstance(val, str)]

    # Run fuzzy grouping on top frequent raw values
    threshold = 80
    mapping_dict = get_fuzzy_groups(top_vals_cleaned, threshold=threshold)


    fuzzy_groups = defaultdict(list)
    for original, canonical in mapping_dict.items():
        fuzzy_groups[canonical].append(original)

    print(f"\nFuzzy groups for column: '{example_column}' (threshold = {threshold})")
    for canonical, members in fuzzy_groups.items():
        if len(members) > 1:
            print(f"  Canonical: '{canonical}' ← {members}\n")
        else:
            print(f"  (Solo) '{canonical}'")


`emp_title` has some string formatting issues. But since the column has high cardinality (too many distinct values), and are not categorisable, they serve little purpose in credit risk modeling. 

`subgrade` is fully clean.

As seen, `addr_state` has few entries that are too long, when it should only consist of 2 string characters. To assess the severity of this, we shall calculate the % of entries that have > 2 string characters. From the results, it is safe to remove all rows that have > 2 characters and they will have little impact on subsequent machine learning model building.

In [0]:

# Code to calculate the percentage of entries with more than 2 characters
total_entries = df.count() 
long_entries_addr_state = df.filter(length(df.addr_state) > 2).count()
percentage_long_entries = (long_entries_addr_state / total_entries) * 100
print(f"Percentage of entries with more than 2 characters: {percentage_long_entries:.2f}%")


## 7. Inspect Numeric Columns (Currently String)
To ensure that type casting these columns to numeric columns will not lead to loss of data integrity, I will inspect top 30 distinct values of such columns. 

In [0]:

for col_name in numeric_columns:
    show_raw_top_values(df, col_name, top_n=30)

From the table, issues are as shown 

**Unusable Columns**
- **annual_inc_joint** 
- **dti_joint**

**Numeric Columns**
- **emp_length**
- **term**

I kept certain columns, despite having high percentage of null values. For example, `mths_since_last_rec` is a strong signal to the good credit health of an individual. 

Columns like `collection_12_mths_ex_med` have string values. However, there are low percentages of such values, and type casting such columns to numeric columns will make little impact on data integrity.

## 8. Inpsecting Date Columns (Currently String)

Given how dates are important in credit risk modeling, due to out-of-time splitting, it is important to conduct type casting in the data cleaning stage. As such, let's inspect unique values in the date columns (currently string).

In [0]:
for col_name in date_columns:
    show_raw_top_values(df, col_name, top_n=30)

As seen, there aren't major issues with current date columns. However, to ensure that time-based analysis. Such columns should be type casted to date-time data type instead. It should be noted that they are of `MM-YYYY` format, and should be type casted to `DD-YYYY` format, to allow out-of-time split.

## 9. Conclusion 

Concluding our findings for string columns, columns needed for subsequent data cleaning is as shown.

In [0]:
# To be continued in finding issues with numeric data ... 
invalid_entries_list = [ "application_type",
    "policy_code",
    "home_ownership",
    "verification_status",
    "loan_status",
    "pymnt_plan",
    "initial_list_status",
    "hardship_flag", 'term', 'emp_length']
    
string_cols_formatting_fix = ['addr_state']


numeric_columns = [
    "id", "member_id", "annual_inc", "annual_inc_joint",
    "dti", "dti_joint", "delinq_2yrs", "fico_range_low", "fico_range_high",
    "inq_last_6mths", "mths_since_last_delinq", "mths_since_last_record",
    "open_acc", "pub_rec", "revol_bal", "revol_util", "total_acc", 
    "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", 
    "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries", 
    "collection_recovery_fee", "last_fico_range_high", "last_fico_range_low", 
    "collections_12_mths_ex_med", "mths_since_last_major_derog",
    "acc_now_delinq", "tot_coll_amt", "loan_amnt", "funded_amnt", 
    "funded_amnt_inv", "installment", "tot_cur_bal", "total_rev_hi_lim", 
    "inq_fi", "hardship_amount", "hardship_dpd", "orig_projected_additional_accrued_interest",
    "hardship_payoff_balance_amount", "hardship_last_payment_amount", 
    "settlement_amount", "settlement_percentage", "settlement_term", 
    "avg_cur_bal", "total_bal_il", "bc_util", "il_util", "total_cu_tl",
    "max_bal_bc", "percent_bc_gt_75", "total_bal_ex_mort", "all_util", 
    "open_acc_6m", "open_act_il", "open_il_12m", "last_pymnt_amnt", "open_il_24m", "mths_since_rcnt_il", 
    "open_rv_12m", "open_rv_24m"
]


date_columns = [
    "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d", "next_pymnt_d",
    "sec_app_earliest_cr_line", "hardship_start_date", "hardship_end_date", 
    "payment_plan_start_date", "debt_settlement_flag_date", "settlement_date"
]

meaningless_columns = [
    "url", "desc", "title", "zip_code", "purpose"
]

unusable_cols = ["emp_title",
    "hardship_type",
    "verification_status_joint",
    "hardship_status",
    "deferral_term",
    "hardship_length",
    "hardship_loan_status",
    "settlement_status", "annual_inc_joint", 'dti_joint']



Unusable columns refer to records that have a large percentage of missing records. Missing values will be dealt by data scientists later on, since it is under feature selection, and would not make sense to be in data engineering pipeline.