This notebook aims to simulate a data scientist consuming data from a data pipeline, using PySpark. 

In this notebook, I will be executing the following steps:
  - Conduct EDA on data produced from Medallion Data Pipeline 
  - Handling Missing Values 

Thereafter, feature engineering, selection and multicollinearity checks will be handled invidually in their respective notebooks. I have specifically done this, since feature engineering should be targeted for LGD, PD, EAD. Doing feature engineering and producing a master table to use for all 3 models makes our model building tedious, computationally expensive and hard to interpret. 

# 1. Import Libraries

In [1]:
# Import function to start Spark
from init_spark import start_spark
spark = start_spark()

from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws,  regexp_extract, sum 
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/19 12:00:42 WARN Utils: Your hostname, Chengs-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.77 instead (on interface en0)
25/07/19 12:00:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/lunlun/Downloads/Github/Credit-Risk-Modeling-PySpark/venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/lunlun/.ivy2.5.2/cache
The jars for the packages stored in: /Users/lunlun/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-a979af09-1b4b-4a76-81f7-e39c6a63da1d;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: 

4.0.0


# 2. EDA: Summary Statistics & Identify Data Issues

In this section, I will be acting as a data scientist (credit risk modeling) pulling data from the Gold Delta Layer of the Medallion Structure. I will be mainly observing summary statistics, spotting and solving issues (e.g. missing values), understanding distribution of features etc. [](url)


In [2]:
df = spark.read.format("delta")\
    .load("../data/gold/medallion_cleaned_lc_data")
    
df.limit(10).toPandas()
    

25/07/19 12:00:46 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,13337218,,25000.0,25000.0,25000.0,36,9.67,802.82,B,B1,...,,,Cash,N,,,,,,
1,13408650,,14675.0,14675.0,14675.0,36,9.67,471.26,B,B1,...,,,Cash,N,,,,,,
2,11687936,,6150.0,6150.0,6150.0,36,10.99,201.32,B,B2,...,,,Cash,N,,,,,,
3,13196289,,28800.0,28800.0,28800.0,60,14.16,672.52,C,C2,...,,,Cash,N,,,,,,
4,14698653,,24000.0,24000.0,24000.0,36,7.9,750.97,A,A4,...,,,Cash,N,,,,,,
5,13517682,,10000.0,10000.0,10000.0,36,12.49,334.49,B,B4,...,,,Cash,N,,,,,,
6,13457743,,8000.0,8000.0,8000.0,36,15.61,279.72,C,C5,...,,,Cash,N,,,,,,
7,13478658,,23875.0,23875.0,23875.0,36,15.31,831.27,C,C4,...,,,Cash,N,,,,,,
8,14619091,,4500.0,4500.0,4500.0,36,9.67,144.51,B,B1,...,,,Cash,N,,,,,,
9,14608817,,10000.0,10000.0,9975.0,36,13.65,340.08,C,C1,...,,,Cash,N,,,,,,


In [3]:
df.summary().toPandas()

25/07/19 12:01:31 WARN DAGScheduler: Broadcasting large task binary with size 2029.9 KiB
                                                                                

Unnamed: 0,summary,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,settlement_status,settlement_amount,settlement_percentage,settlement_term
0,count,2221208.0,0.0,2221208.0,2221208.0,2221208.0,2221208.0,2221208.0,2221208.0,2221208,...,8969,6897.0,8969.0,8969.0,2221208,2221208,33178,33178.0,33178.0,33178.0
1,mean,80068723.37822752,,15031.747758426944,15026.667392247822,15011.568229119825,42.876539252514846,13.051038511501284,445.29413120701327,,...,,440.16573727707726,11443.788268480326,189.3172739435833,,,,5027.6609132557705,47.70444873108709,13.157785279402011
2,stddev,44994716.28885557,,9178.03857820431,9176.12810849421,9177.664163082043,10.851276559961232,4.816921036702081,266.6961572525336,,...,,363.06633872576265,7528.382889632834,197.75385252713272,,,,3685.28462379217,7.315936494507836,8.231626979758733
3,min,54734.0,,500.0,500.0,0.0,36.0,5.31,4.93,A,...,Current,1.92,55.73,0.01,Cash,N,ACTIVE,44.21,0.2,0.0
4,25%,44583572.0,,8000.0,8000.0,8000.0,36.0,9.49,251.59,,...,,170.52,5539.91,39.91,,,,2225.0,45.0,6.0
5,50%,84012975.0,,12800.0,12800.0,12800.0,36.0,12.62,377.51,,...,,343.83,9878.72,128.02,,,,4172.71,45.0,14.0
6,75%,122263712.0,,20000.0,20000.0,20000.0,60.0,15.88,591.88,,...,,595.65,15775.43,276.98,,,,6876.0,50.0,18.0
7,max,145647287.0,,40000.0,40000.0,40000.0,60.0,30.99,1719.83,G,...,Late (31-120 days),2680.89,40306.41,1407.86,DirectPay,Y,COMPLETE,33601.0,521.35,181.0


Based on the summary statistics of the Lending Club dataset, several data quality issues become apparent. 
- Some columns are unusable, due to to having a **large percentage of missing values** (taken reference from `../sandbox/string_issues`)

- A number of features exhibit **missing values**, including `emp_length` etc. This requires imputation strategies to be implemented

- Some columns are irrelevant to our credit risk modeling project, due to **high cardinality** (large number of unique values), e.g. `emp_title` **(categorical data)**. 

- There are also **redundant columns**, like `member_id`, which provides no value to our prediction of LGD, EAD and PD.

- There are also **post-loan information**. This means that the value of these features are generated after loan origination (attaining application and approval). Hence, such features should be dropped, since they would skew our subsequent machine learning models. Such features include `total_pymnt`, `last_pymnt_d`. These shall be removed in individual modelling notebooks, since models like LGD and EAD will require them. 

- There are features like `delinq_2yrs` which have **outliers** (maximum data point way above the 75% quartile)


# 3. Handling Missing Values

Based on the above issues identified by me with the Lending Club Dataset, I will now be tackling each of them in order. 

### 3.1 Find Null Value % Per Column 
For this credit risk modeling project, I will be dropping columns with &gt; 50% missingness.  Many credit risk modelling projects on Kaggle and Github use 50%-65% missingness as the threshold to drop columns as well. 

However, let's display the columns which have >=50% missing values first to inspect them 

In [4]:

# Get total number of rows
total_rows = df.count()

# Calculate % of nulls per column and keep only those ≥ 50%
missing_val_threshold = 30

high_missingness_columns = []

for column in df.columns:
    null_count = df.select(sum(col(column).isNull().cast("int"))).collect()[0][0]
    null_pct = (null_count / total_rows) * 100
    if null_pct >= missing_val_threshold:
        print(f"{column}: {null_pct:.2f}% null")
        high_missingness_columns.append(column)

# Drop columns with >= 50% missing values (Low predictive power upon inspection)
df = df.drop(*high_missingness_columns) 
print("\n✅ Columns with high pct of missing values dropped ... \n")

# Inspect Dimensions

num_rows = df.count()

num_cols = len(df.columns)

print(f"Updated Shape: ({num_rows}, {num_cols})")

member_id: 100.00% null
emp_length: 100.00% null
mths_since_last_delinq: 51.28% null
mths_since_last_record: 84.11% null
next_pymnt_d: 60.29% null
mths_since_last_major_derog: 74.31% null
annual_inc_joint: 94.79% null
dti_joint: 94.79% null
verification_status_joint: 95.00% null
open_acc_6m: 38.60% null
open_act_il: 38.60% null
open_il_12m: 38.60% null
open_il_24m: 38.60% null
mths_since_rcnt_il: 40.51% null
total_bal_il: 38.60% null
il_util: 47.51% null
open_rv_12m: 38.60% null
open_rv_24m: 38.60% null
max_bal_bc: 38.60% null
all_util: 38.60% null
inq_fi: 38.60% null
total_cu_tl: 38.60% null
inq_last_12m: 38.60% null
mths_since_recent_bc_dlq: 77.01% null
mths_since_recent_revol_delinq: 67.23% null
revol_bal_joint: 95.34% null
sec_app_fico_range_low: 95.34% null
sec_app_fico_range_high: 95.34% null
sec_app_earliest_cr_line: 95.34% null
sec_app_inq_last_6mths: 95.34% null
sec_app_mort_acc: 95.34% null
sec_app_open_acc: 95.34% null
sec_app_revol_util: 95.42% null
sec_app_open_act_il: 95.

### 3.2 Dropping Irrelevant/Redundant Columns 
This section implements the removal of **meaningless columns, features which has high cardinality (categorical data), features with little predictive value, e.g. `member_id`, `emp_title` etc, and post-loan features**. Including such features may lead to multicollinearity, and ultimately lead to low predictive power of our credit models. 

Reasons why I removed certain columns are as shown: 
- Columns with `inv`: Largely same as its subset, e.g. `total_pymnt_inv` is largely the same as `total_pymnt`

- `last_pymnt_d` and `last_credit_pull_d` (according to Data Dictionary) have little predictive value even after feature engineering. It merely shows the last payment date by borrower and last date where credit report is pulled. This has little value in predicting PD, LGD or EAD. 

- `sub_grade` is more granular than `grade`. This may lead to a risk of overfitting of our PD, LGD, and EAD models. 

- High Cardinality Columns may lead to high computational costs in encoding for machine learning models, which makes it undesirable in a big data space such as credit risk. 

- Hardship & Settlement Features (Borrowers are only eligible for hardship and settlement programmes after loan origination for Lending Club, not when they apply for it). Borrowers will contact lenders of financial hardship, attempting to settle with lenders for interest-fee payments or lower principal sum payments. Such features should not be used to predict PD, LGD, and EAD. My models should not know if a borrower will fall into hardship for this credit risk modeling project 

- `disbursement_method` indicates how loan funds are delivered to the borrower. This has little relevance in predicting PD, LGD or EAD. 

- 🚩 Low Variance Features may lead to slower running of PCA (which aims to reduce dimensionality). They also add little value to prediction of PD, EAD and LGD. (Dealt after standardisation)


In [5]:
# Drop Derived/Meaningless Features 
derived_features = ["funded_amnt_inv", "sub_grade", "out_prncp_inv", "total_pymnt_inv", "last_pymnt_d", "last_credit_pull_d"] 
df = df.drop(*derived_features)
print(f"✅ Derived/Meaningless Features Dropped ...")


✅ Derived/Meaningless Features Dropped ...


In [6]:
# Drop High Cardinality Features 

# 1. Define Threshold 
high_cardinality_threshold = 50

# 2. Find Categorical Features (to identify high cardinality columns)


categorical_cols = [field.name for field in df.schema.fields if isinstance(field.dataType, StringType)]
print(categorical_cols)

# 3. Identify high-cardinality columns
high_card_cols = []

for col_name in categorical_cols:
    unique_count = df.select(col_name).distinct().count()

    if unique_count >= high_cardinality_threshold:
        print(f"\n{col_name} has {unique_count} unique values → dropping ... ")
        high_card_cols.append(col_name)

# 4. Drop high cardinality columns
df = df.drop(*high_card_cols)

print(f"\n✅ High Cardinality Features Dropped ...")

['grade', 'emp_title', 'home_ownership', 'verification_status', 'pymnt_plan', 'addr_state', 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag']



emp_title has 477269 unique values → dropping ... 

addr_state has 51 unique values → dropping ... 

✅ High Cardinality Features Dropped ...


In [7]:
#! Drop columns with only 1 distinct value


def drop_constant_columns(df):
    """
    Removes all columns in the DataFrame that have only one distinct value.
    Returns a new DataFrame with those columns removed.
    """
    cols_to_drop = []

    for column_name in df.columns:
        if df.select(col(column_name)).distinct().count() <= 1:
            cols_to_drop.append(column_name)

    print(f"⚠️ Dropping constant columns: {cols_to_drop}")
    return df.drop(*cols_to_drop)

df = drop_constant_columns(df)

print("✅ Columns with only 1 distinct value dropped...")

⚠️ Dropping constant columns: ['policy_code']
✅ Columns with only 1 distinct value dropped...


In [8]:
# Drop hardship related columns & miscelleanous columns 
hardship_columns = ["hardship_flag", "disbursement_method", "debt_settlement_flag", 'policy_code']

df = df.drop(*hardship_columns)
print("✅ Hardship & miscelleneous columns dropped ...")

✅ Hardship & miscelleneous columns dropped ...


🚩 I will need to remove post-loan origination features later on for PD prediction. Post-loan origination features such as `recoveries` are needed for LGD, EAD prediction, but not for PD prediction. 

### 3.4 Impute Missing Values (Categorical & Numerical)
After removing unnecessary columns with little predictive power, we will proceed to impute missing values. We will first identify % missing values per column. 

For numerical columns, median values shall replace missing values, given how we haven't dealt with outliers yet. For categorical columns, mode categories shall be used to replace missing values. Such an approach is common and simplistic, though there are advanced imputation techniques like clustering. However, we shall not lose focus of learning about the credit risk modeling domain in this project. 

In [9]:
total_rows = df.count()

for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        print(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

inq_last_6mths: 1 null values, 0.0% missing values.
collections_12_mths_ex_med: 54 null values, 0.0% missing values.
tot_coll_amt: 67219 null values, 3.03% missing values.
tot_cur_bal: 67219 null values, 3.03% missing values.
total_rev_hi_lim: 67219 null values, 3.03% missing values.
acc_open_past_24mths: 46992 null values, 2.12% missing values.
avg_cur_bal: 67219 null values, 3.03% missing values.
bc_open_to_buy: 69761 null values, 3.14% missing values.
bc_util: 70649 null values, 3.18% missing values.
chargeoff_within_12_mths: 54 null values, 0.0% missing values.
mo_sin_old_il_acct: 134567 null values, 6.06% missing values.
mo_sin_old_rev_tl_op: 67219 null values, 3.03% missing values.
mo_sin_rcnt_rev_tl_op: 67219 null values, 3.03% missing values.
mo_sin_rcnt_tl: 67219 null values, 3.03% missing values.
mort_acc: 46992 null values, 2.12% missing values.
mths_since_recent_bc: 68403 null values, 3.08% missing values.
mths_since_recent_inq: 287839 null values, 12.96% missing values.
nu

In [10]:
# Loop over each column
for feature in df.schema.fields:
    col_name = feature.name
    dtype = feature.dataType

    if isinstance(dtype, StringType):
        mode_value = (
            df.groupBy( col(f"{col_name}")  )
            .count()
            .orderBy(col("count").desc()) 
            .first()[0]
        )

        df = df.fillna({f"{col_name}": mode_value})



    # Impute Numerical Columns with Median
    elif isinstance(dtype, IntegerType) or isinstance(dtype, DoubleType):
        if df.filter(   col(col_name).isNull()  ).count() > 0:
            median_val = df.approxQuantile(col_name, [0.5], 0.01)[0]
            df = df.fillna({col_name: median_val})

print('✅ Categorical Column Missing Values Filled!')
print('✅ Numerical Column Missing Values Filled!')

✅ Categorical Column Missing Values Filled!
✅ Numerical Column Missing Values Filled!


In [11]:
# Double check if there are any missing values before subsequent steps
total_rows = df.count()

output_arr = []
for column in df.columns: 
    null_count = df.filter(col(column).isNull()).count()
    if null_count > 0:
        output_arr.append(f"{column}: {null_count} null values, {round(null_count/total_rows * 100,2)}% missing values.")

if len(output_arr) == 0: 
    print('✅ No Missing Values Found!')
else:
    print(output_arr)

✅ No Missing Values Found!


In [12]:
df.limit(10).toPandas()

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,home_ownership,annual_inc,verification_status,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,13337218,25000.0,25000.0,36,9.67,802.82,B,MORTGAGE,110000.0,Verified,...,0.0,4.0,100.0,0.0,0.0,0.0,307625.0,49812.0,40600.0,32325.0
1,13408650,14675.0,14675.0,36,9.67,471.26,B,MORTGAGE,51000.0,Source Verified,...,0.0,2.0,94.7,0.0,0.0,0.0,76535.0,24729.0,54500.0,19435.0
2,11687936,6150.0,6150.0,36,10.99,201.32,B,MORTGAGE,58000.0,Verified,...,0.0,1.0,100.0,33.3,0.0,0.0,372565.0,42396.0,28000.0,36863.0
3,13196289,28800.0,28800.0,60,14.16,672.52,C,MORTGAGE,79000.0,Verified,...,0.0,2.0,95.3,0.0,0.0,1.0,297776.0,16627.0,20300.0,26467.0
4,14698653,24000.0,24000.0,36,7.9,750.97,A,MORTGAGE,93000.0,Verified,...,0.0,0.0,100.0,25.0,0.0,0.0,305424.0,38240.0,85000.0,0.0
5,13517682,10000.0,10000.0,36,12.49,334.49,B,RENT,35000.0,Verified,...,0.0,1.0,96.3,28.6,0.0,0.0,20600.0,8868.0,9300.0,0.0
6,13457743,8000.0,8000.0,36,15.61,279.72,C,RENT,23964.0,Verified,...,0.0,3.0,92.3,60.0,0.0,0.0,8300.0,3623.0,5300.0,0.0
7,13478658,23875.0,23875.0,36,15.31,831.27,C,MORTGAGE,56900.0,Verified,...,0.0,1.0,100.0,20.0,0.0,0.0,135700.0,15090.0,20800.0,0.0
8,14619091,4500.0,4500.0,36,9.67,144.51,B,MORTGAGE,50000.0,Verified,...,0.0,4.0,95.3,0.0,0.0,0.0,147702.0,25553.0,11900.0,29402.0
9,14608817,10000.0,10000.0,36,13.65,340.08,C,OWN,36275.0,Verified,...,1.0,0.0,92.0,50.0,1.0,0.0,18462.0,11001.0,3800.0,3362.0


In [13]:
# df3 represents the DataFrame after finishing Chapter 3 Handling Missing Values 

df.write.format("delta").mode("overwrite").option('mergeSchema', 'true').save("../data/gold/medallion_cleaned_lc_data_b4_model")


                                                                                

In [14]:
# Check if Gold Delta is accessible for subsequent model building 
gold_table = spark.read.format("delta")\
    .load("../data/gold/medallion_cleaned_lc_data_b4_model")
    
gold_table.limit(10).toPandas()

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,home_ownership,annual_inc,verification_status,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,104015209,12000.0,12000.0,60,13.99,279.16,C,OWN,38000.0,Verified,...,0.0,3.0,100.0,50.0,1.0,0.0,39777.0,24900.0,12400.0,19777.0
1,106185964,6025.0,6025.0,36,12.74,202.26,C,RENT,33384.0,Source Verified,...,0.0,4.0,100.0,80.0,0.0,0.0,35525.0,19883.0,11500.0,12125.0
2,107078349,7200.0,7200.0,36,19.99,267.55,D,RENT,20000.0,Verified,...,0.0,1.0,100.0,60.0,0.0,0.0,22900.0,15047.0,19800.0,0.0
3,106792441,6025.0,6025.0,36,25.49,241.12,E,RENT,32000.0,Source Verified,...,0.0,3.0,50.0,66.7,0.0,0.0,45974.0,38462.0,2500.0,41074.0
4,105825055,7500.0,7500.0,36,13.99,256.3,C,MORTGAGE,40000.0,Source Verified,...,0.0,3.0,94.1,75.0,0.0,0.0,199247.0,16070.0,12000.0,13544.0
5,104612483,20000.0,20000.0,60,14.99,475.7,C,OWN,53000.0,Verified,...,0.0,2.0,100.0,50.0,0.0,0.0,78639.0,19288.0,17500.0,8639.0
6,104899462,40000.0,40000.0,36,12.74,1342.76,C,MORTGAGE,124000.0,Verified,...,0.0,7.0,100.0,7.7,0.0,0.0,421087.0,138941.0,52700.0,124637.0
7,105822568,4500.0,4500.0,36,12.74,151.06,C,RENT,32174.76,Verified,...,0.0,0.0,100.0,50.0,0.0,0.0,17483.0,14480.0,4900.0,12583.0
8,102433536,5000.0,5000.0,36,13.99,170.87,C,RENT,20000.0,Verified,...,0.0,2.0,100.0,100.0,1.0,0.0,18600.0,13844.0,9200.0,4000.0
9,106967221,8000.0,8000.0,36,7.24,247.9,A,MORTGAGE,55000.0,Not Verified,...,0.0,1.0,94.7,22.2,0.0,0.0,57300.0,21890.0,17900.0,8000.0


In [None]:
print("✅ All missing values filled and saved to Gold Delta Table!")
print("✅ Ready for PD, LGD, and EAD Modeling!")


✅ All missing values filled and saved to Gold Delta Table!
✅ Ready for PD, LGD, and EAD Modeling!


25/07/19 20:36:31 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 154342 ms exceeds timeout 120000 ms
25/07/19 20:36:31 WARN SparkContext: Killing executors is not supported by current scheduler.
25/07/19 20:38:59 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:81)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:669)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1296)
	at o