This notebook aims to simulate a data scientist consuming data from a data pipeline, using PySpark. 

In this notebook, I will be executing the following steps:
  - Conduct EDA on data produced from Medallion Data Pipeline (Missing Values, Identifying Distributions and Relationships etc)
  - Deal with Multicollinearity 
  - Feature Selection & Transformation, Standardisation, Dimensionality Reduction 
  - Dealing with Dataset Imbalance

# 1. Import Libraries

In [0]:
from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws,  regexp_extract, sum 
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)

# 2. EDA: Summary Statistics & Identify Data Issues

In this section, I will be acting as a data scientist (credit risk modeling) pulling data from the Gold Delta Layer of the Medallion Structure. I will be mainly observing summary statistics, spotting and solving issues (e.g. missing values), understanding distribution of features etc. [](url)


In [0]:
df = spark.read.table('gold.medallion_cleaned_lc_data')
df.limit(10).display()

In [0]:
df.summary().display() 

Based on the summary statistics of the Lending Club dataset, several data quality issues become apparent. 
- Some columns are unusable, due to to having a **large percentage of missing values** (taken reference from `../sandbox/string_issues`)

- A number of features exhibit **missing values**, including `emp_length` etc. This requires imputation strategies to be implemented

- Some columns are irrelevant to our credit risk modeling project, due to **high cardinality** (large number of unique values), e.g. `emp_title` **(categorical data)**. 

- There are also **redundant columns**, like `member_id`, which provides no value to our prediction of LGD, EAD and PD.

- There are also **post-loan information**. This means that the value of these features are generated after loan origination (attaining application and approval). Hence, such features should be dropped, since they would skew our subsequent machine learning models. Such features include `total_pymnt`, `last_pymnt_d`

- There are features like `delinq_2yrs` which have **outliers** (maximum data point way above the 75% quartile)

- There are some **invalid data points**, e.g. `dti` being > 100%. Such data points should be removed manually, since they are not handled well enough by the Medallion Architecture. 


# 3. EDA: Feature Handling 

Based on the above issues identified by me with the Lending Club Dataset, I will now be tackling each of them in order. 

### 3.1 Find Null Value % Per Column 
For this credit risk modeling project, I will be dropping columns with &gt; 50% missingness.  Many credit risk modelling projects on Kaggle and Github use 50%-65% missingness as the threshold to drop columns as well. 

However, let's display the columns which have >=50% missing values first to inspect them 

In [0]:

# Get total number of rows
total_rows = df.count()

# Calculate % of nulls per column and keep only those ≥ 50%
missing_val_threshold = 30

high_missingness_columns = []

for column in df.columns:
    null_count = df.select(sum(col(column).isNull().cast("int"))).collect()[0][0]
    null_pct = (null_count / total_rows) * 100
    if null_pct >= missing_val_threshold:
        print(f"{column}: {null_pct:.2f}% null")
        high_missingness_columns.append(column)

# Drop columns with >= 50% missing values (Low predictive power upon inspection)
df = df.drop(*high_missingness_columns) 
print("\n✅ Columns with high pct of missing values dropped ... \n")

# Inspect Dimensions

num_rows = df.count()

num_cols = len(df.columns)

print(f"Updated Shape: ({num_rows}, {num_cols})")

### 3.3 Dropping Irrelevant/Redundant Columns 
This section implements the removal of **meaningless columns, features which has high cardinality (categorical data), features with little predictive value, e.g. `member_id`, `emp_title` etc, and post-loan features**. Including such features may lead to multicollinearity, and ultimately lead to low predictive power of our credit models. 

Reasons why I removed certain columns are as shown: 
- Columns with `inv`: Largely same as its subset, e.g. `total_pymnt_inv` is largely the same as `total_pymnt`

- `last_pymnt_d` and `last_credit_pull_d` (according to Data Dictionary) have little predictive value even after feature engineering. It merely shows the last payment date by borrower and last date where credit report is pulled. This has little value in predicting PD, LGD or EAD. 

- `sub_grade` is more granular than `grade`. This may lead to a risk of overfitting of our PD, LGD, and EAD models. 

- High Cardinality Columns may lead to high computational costs in encoding for machine learning models, which makes it undesirable in a big data space such as credit risk. 

- Hardship & Settlement Features (Borrowers are only eligible for hardship and settlement programmes after loan origination for Lending Club, not when they apply for it). Borrowers will contact lenders of financial hardship, attempting to settle with lenders for interest-fee payments or lower principal sum payments. Such features should not be used to predict PD, LGD, and EAD. My models should not know if a borrower will fall into hardship for this credit risk modeling project 

- `disbursement_method` indicates how loan funds are delivered to the borrower. This has little relevance in predicting PD, LGD or EAD. 

- 🚩 Low Variance Features may lead to slower running of PCA (which aims to reduce dimensionality). They also add little value to prediction of PD, EAD and LGD. (Dealt after standardisation)


In [0]:
# Drop Derived/Meaningless Features 
derived_features = ["funded_amnt_inv", "sub_grade", "out_prncp_inv", "total_pymnt_inv", "last_pymnt_d", "last_credit_pull_d"] 
df = df.drop(*derived_features)
print(f"✅ Derived/Meaningless Features Dropped ...")


In [0]:
# Drop High Cardinality Features 

# 1. Define Threshold 
high_cardinality_threshold = 50

# 2. Find Categorical Features (to identify high cardinality columns)


categorical_cols = [field.name for field in df.schema.fields if isinstance(field.dataType, StringType)]
print(categorical_cols)

# 3. Identify high-cardinality columns
high_card_cols = []

for col_name in categorical_cols:
    unique_count = df.select(col_name).distinct().count()

    if unique_count >= high_cardinality_threshold:
        print(f"\n{col_name} has {unique_count} unique values → dropping ... ")
        high_card_cols.append(col_name)

# 4. Drop high cardinality columns
df = df.drop(*high_card_cols)

print(f"\n✅ High Cardinality Features Dropped ...")

In [0]:
# Drop hardship related columns & miscelleanous columns 
hardship_columns = ["hardship_flag", "disbursement_method", "debt_settlement_flag", 'policy_code']

df = df.drop(*hardship_columns)
print("✅ Hardship & miscelleneous columns dropped ...")

🚩 I will need to remove post-loan origination features later on for PD prediction. Post-loan origination features such as `recoveries` are needed for LGD, EAD prediction, but not for PD prediction. 

### 3.4 Impute Missing Values (Categorical & Numerical)
After removing unnecessary columns with little predictive power, we will proceed to impute missing values. We will first identify % missing values per column. 

For numerical columns, median values shall replace missing values, given how we haven't dealt with outliers yet. For categorical columns, mode categories shall be used to replace missing values. Such an approach is common and simplistic, though there are advanced imputation techniques like clustering. However, we shall not lose focus of learning about the credit risk modeling domain in this project. 

In [0]:
total_rows = df.count()

for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        print(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

In [0]:
# Loop over each column
for feature in df.schema.fields:
    col_name = feature.name
    dtype = feature.dataType

    if isinstance(dtype, StringType):
        mode_value = (
            df.groupBy( col(f"{col_name}")  )
            .count()
            .orderBy(col("count").desc()) 
            .first()[0]
        )

        df = df.fillna({f"{col_name}": mode_value})



    # Impute Numerical Columns with Median
    elif isinstance(dtype, IntegerType) or isinstance(dtype, DoubleType):
        if df.filter(   col(col_name).isNull()  ).count() > 0:
            median_val = df.approxQuantile(col_name, [0.5], 0.01)[0]
            df = df.fillna({col_name: median_val})

print('✅ Categorical Column Missing Values Filled!')
print('✅ Numerical Column Missing Values Filled!')




In [0]:
# Double check if there are any missing values before subsequent steps
total_rows = df.count()

output_arr = []
for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        output_arr.append(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

if len(output_arr) == 0: 
    print('✅ No Missing Values Found!')
else:
    print(output_arr)

In [0]:
df_filled_missing = df

### 3.5 Handling Outliers 

It is important to handle outliers since they create skewed distributions, which can distort credit risk models.

They can dominate learning, causing bias or overfitting. 

But in credit risk, it is important to note that some outliers (e.g., bankrupted borrowers) are important signals. Such outliers should not be blindly trimmed. 

We will first be using `approxQuantile()` method, which is a relatively more computational effective way to identify outliers for big data. This shall be used in computing the % of outliers per numerical column. 

In [0]:
def percent_outliers(df, col_name, lower_pct=0.25, upper_pct=0.75):
    # 1. Compute percentile bounds
    quantiles = df.approxQuantile(col_name, [lower_pct, upper_pct], 0.01)
    q1, q3 = quantiles[0], quantiles[1]
    iqr = q3 - q1 

    # 2. Obtain lower and upper bound, any data points outside of this are seen as outliers 
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    total_rows = df.count()

    return df.filter( (col(col_name) < lower_bound) | (col(col_name) > upper_bound) ).count() / total_rows * 100

# 1. After cleaning missing values, find % outliers per column 
sample_df = df_filled_missing.sample(fraction=0.05, seed=42) # Efficiency 


outliers_dict = {}

for feature in sample_df.schema: 
    col_name = feature.name 
    data_type = feature.dataType

    if isinstance(data_type, (DoubleType, IntegerType)): 
        outlier_pct = percent_outliers(sample_df, col_name) 
        if outlier_pct > 0: 
            outliers_dict[col_name] = outlier_pct
            print(f"{col_name}: {round(outlier_pct,2) }% ")


Now, we can define concrete rules on how I should deal with outliers. Rules are as shown below: 
- **Very few outliers (1 - 2%)**: Trim / Drop rows 
- **Few Outliers & Outliers are Valid but Extreme Entries**: Winsorise (Cap values)

- **High % Outliers**: Likely to be heavily right-skewed (to be transformed later on)

The goal of dropping and winsorizing is to ensure % of outliers remain slightly below 5%. To ensure these records do not skew our credit risk models, they will be transformed later on. 

In [0]:
# 2. Save df to df1 before dealing with outliers 
df1 = df_filled_missing

In [0]:

# 3. Perform trimming & winsorisation 
def winsorize_column(df, col_name, lower_pct=0.25, upper_pct=0.75):
    # Get lower and upper bounds
    bounds = df.approxQuantile(col_name, [lower_pct, upper_pct], 0.01)
    lower, upper = bounds[0], bounds[1]

    # Apply winsorization
    return df.withColumn(
        col_name,  
        when(col(col_name) < lower, lower)
        .when(col(col_name) > upper, upper)
        .otherwise(col(col_name))
    )

# Debug: ensure dictionary is not empty
print(f"🧮 Total columns with outliers: {len(outliers_dict)}\n")

# Iterate over each column in outliers_dict
for col_name in outliers_dict.keys(): 
    pct = outliers_dict[col_name]

    if 0 < pct <= 1:
        print(f"✅ Dropping rows with outliers in {col_name} ({round(pct, 2)}%) ...")
        q1, q3 = df1.approxQuantile(col_name, [0.25, 0.75], 0.01)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        df1 = df1.filter((col(col_name) >= lower) & (col(col_name) <= upper))

    elif 1 < pct <= 5:
        print(f"✅ Winsorizing {col_name} ({round(pct, 2)}%) ...")
        df1 = winsorize_column(df1, col_name)

    else:
        print(f"🚩 Skipping {col_name} ({round(pct, 2)}%) — too many outliers ...")



In [0]:
# 4. Check for outlier % again 
sample_df1 = df1.sample(fraction=0.05, seed=42) # Efficiency 

for feature in df1.schema: 
    col_name = feature.name 
    data_type = feature.dataType

    if isinstance(data_type, (DoubleType, IntegerType)): 
        outlier_pct = percent_outliers(df1, col_name) 
        if outlier_pct > 0: 
            print(f"{col_name}: {round(outlier_pct,2) }% ")

In [0]:
# 5. Check distribution for all numerical columns 
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql.types import NumericType

# 1. Select numerical columns
numeric_cols = [field.name for field in df.schema if isinstance(field.dataType, NumericType)]

# 2. Sample small portion of data (e.g., 5%) and convert to pandas
sample_df2 = df1.select(numeric_cols).sample(fraction=0.05, seed=42)
sample_pdf = sample_df2.toPandas()

# 3. Plot histograms as subplots
n_cols = 3  # Number of plots per row
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4 * n_rows))
axes = axes.flatten()

for i, col_name in enumerate(numeric_cols):
    axes[i].hist(sample_pdf[col_name].dropna(), bins=50, color='skyblue')
    axes[i].set_title(col_name, fontsize=10)
    axes[i].tick_params(axis='x', rotation=45)

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()


### 3.6 Peform Feature Transfromation (Skewness)
We will group the following numerical features into the following categories: 
- **Weird Bimodal Distributions**: Avoid over-transforming them since this reflects real-world extreme borrowers' behaviour / demographics
- **Highly Right Skewed (Continuous)**: Perform log transformation (if skewness issue is unsolved, utilise Box-Cox)
- **Highly Right Skewed (Pct / Probability)**: Logit Transformation
- **Left Skewed Variables**: Use Power / Exponential Transformations

In [0]:
# 6. Perform feature transformation techniques to address skewness 

## 3. EDA: Examining Distributions and Feature Relationships

### Bivariate Analysis 

### Multicollinearity Handling 

## 4. Feature Selection & Engineering 

### Standardisation 

### Dimensionality Reduction 

## 5. Handling Dataset Imbalance 