# String Formatting EDA (Lending Club)
This notebook aims to use **RapidFuzz** to quickly find out issues in string columns in the Bronze Delta Table Lending Club Dataset. After finding out such issues, string formatting issues will then be resolved via the Medallion Architecture Data Cleaning Pipeline. 

## 1. Import Required Libraries

In [0]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, trim, lower, count, desc
from rapidfuzz import process

## 2. Read Data 

In [0]:
# Read from Bronze Delta Table (Uncleaned & Raw)
df = spark.read.format("delta").load("bronze_raw_lendingclub_data")
df.printSchema() 

# Adjust the path to wherever your raw Lending Club Delta table sits
# raw_df = spark.read.format("delta").load("/mnt/bronze/lending_club")


## 3. Identify all String Columns 

In [0]:
all_string_cols = [
    f.name
    for f in df.schema.fields
    if isinstance(f.dataType, StringType)
]
print("String columns:", all_string_cols)


print(f"Number of String Columns: {len(all_string_cols)}")


## 4. Identify Number of Distinct Values Per String Column
The following code prints how many distinct values each string column has, so we can decide which columns we can tackle with fuzzy grouping with Rapid Fuzz (more computationally expensive) and which allows manual cleaning (lower effort and cost). 



In [0]:
for col_num, col_name in enumerate(all_string_cols, 1): # starts from col_num = 1 
    distinct_count = df.select(col_name).distinct().count() 
    print(f"{col_num:02d}. {col_name} -> {distinct_count:6d} distinct values" ) 


## 5. Show Top Frequent N Values Per String Column 
Before doing any fuzzy grouping, we can see the top frequent values per column in strings. An example would be `home_ownership` showing 'rent', 'Rent' and 'RENT '. This will affect subsequent analysis and skew machine learning models. 

As such, the following approach allows me to check if a column has obvious variants, guiding our decisions on which columns to cluster with RapidFuzz to reduce computational expenses. 


In [0]:
def show_raw_top_values(df, column, top_n=30):
    print(f"\nTop {top_n} raw values for column '{column}':")
    (
        df
        .groupBy(col(column))
        .count()
        .orderBy(desc("count"))
        .limit(top_n)
        .show(truncate=False)
    )



## 6. RapidFuzz String Similarity Matching Function
Now that we have drilled down to the columns that have inconsistent spelling and formats, we would want to group similar strings together and map similar variants to 1 single string value. 

RapidFuzz clusters items that exceed the similarity threshold. It is a library that is better than FuzzyWuzzy for string matching, which allows better performance in big data environments. 


In [0]:
def get_fuzzy_groups(distinct_values, threshold=85):
    """
    Given a Python list of distinct strings from a column, return a dict mapping each string
    to a chosen 'canonical' string (first occurence). Two strings with similarity ≥ threshold
    collapse into the same 'canonical string'.

    Output:
    {
        "Rent": "Rent",
        "RENT": "Rent",
        "rent": "Rent",
        "Own": "Own",
        "own": "Own",
        "mortgag": "mortgag",
        "Mortgage": "mortgag"
    }

    """
    canonical_list = [] # accepted clean values 
    mapping = {}

    for val in distinct_values:
        
        # Skip whitespaces, empty strings or None 
        if not val or val.strip() == "":
            continue

        match, score, _ = process.extractOne(val, canonical_list, score_cutoff=threshold)

        # If high matching / similarity, then add to mapping dict key, e.g. 'ReNt' -> 'rent' 
        # Else (low similarity), add to canonical_list (fresh new <string>)
        if match:
            mapping[val] = match
        else:
            canonical_list.append(val)
            mapping[val] = val

    return mapping


### 6.1 RapidFuzz String Matching Test (1 column)

The following code demonstrates how fuzzy matching works, for learning purposes. 

In [0]:
from collections import defaultdict

# Choose one column you want to test fuzzy grouping on
example_column = "home_ownership"

# Step 1: All Row Objects in this column 
distinct_vals = df.select(example_column).distinct().collect()

# Step 2: Row objects -> Raw Strings (keeping None values as-is) (Fuzzy-matching only accepts strings)
distinct_vals = [row[example_column] for row in distinct_vals]

# Step 3: Run fuzzy grouping on non-null values only, default_threshold=85% 
threshold = 85
mapping_dict = get_fuzzy_groups([v for v in distinct_vals if v is not None], threshold=threshold)

# Step 4: Ensures 'Rent' : ['Rent', 'rEnt', 'RENT'] in fuzzy_groups 
fuzzy_groups = defaultdict(list) # auto-creates key, if key is not in the dictionary 
for original_val, canonical_val in mapping_dict.items():
    fuzzy_groups[canonical_val].append(original_val)

# Output
print(f"\nFuzzy groups for column: '{example_column}' (threshold = {threshold})")
for canonical, group_members in fuzzy_groups.items():

    # If there are many variants of 'Rent'
    if len(group_members) > 1:
        print(f"  Canonical: '{canonical}' ← {group_members}")

    # Distinct Value (No variants)
    else:
        print(f"  (Solo) '{canonical}'")


### 6.2 RapidFuzz Fuzzy Grouping (String Columns) 
In this section, we will **skipping fuzzy grouping** for some columns, due to data mismatch. For example, `loan_term` should be skipped since it has numeric data, but it is referenced to as a string in the Bronze Delta Table. 

In [0]:
# Skipping selected columns 
# skip_fuzzy_columns = ["term", "zip_code", "int_rate", "loan_amnt"]


# # For each column (not skipped), display top 10 values and count of records 
# for i, col_name in enumerate([c for c in all_string_cols if c not in skip_fuzzy_columns], 1):
#     print(f"\n{i}. Column: '{col_name}'")

#     # Show top 10 values of each column 
#     df.groupBy(col_name).count().orderBy("count", ascending=False).show(10, truncate=False)

#     # Get distinct values (excluding nulls)
#     values = df.select(col_name).distinct().collect()
#     values = [row[col_name] for row in values if row[col_name] is not None]

#     # Fuzzy match similar values
#     groups = get_fuzzy_groups(values, threshold=90)

#     # Print fuzzy clusters
#     grouped = defaultdict(list)
#     for original, canonical in groups.items():
#         grouped[canonical].append(original)

#     for canon, variants in grouped.items():
#         if len(variants) > 1:
#             print(f"  Canonical: '{canon}' ← {variants}")


## 7. Export String Mapping Dictionary 