This notebook aims to utilise sklearn to build the final machine learning model for our PD Model. 
To build this final model, we will need to conduct out of time train-test split, select features we have selected in feature engineering, conduct WoE binning for our features, then train the final Logistic Regression Model.

# 0. Import Libraries


In [1]:
# === Standard libraries ===
import os


# === WandB Logging  ===
import wandb

wandb.login(key=os.getenv("WANDB_API_KEY"))


# === Spark Session & Functions ===
from init_spark import start_spark

spark = start_spark()
from pyspark.sql.functions import (
    col,
    when,
    count,
    desc,
    isnan,
    isnull,
    lit,
    length,
    trim,
    lower,
    upper,
    to_date,
    concat_ws,
    regexp_extract,
    mean,
)
from pyspark.sql.types import (
    StructType,
    StructField,
    StringType,
    DoubleType,
    IntegerType,
    DateType,
    NumericType,
    FloatType,
    LongType,
)


# === Pandas Dataframe & WoE Binning ===
import pandas as pd
from tabulate import tabulate
from optbinning import OptimalBinning
import numpy as np

# == Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns
import matplotlib.pyplot as plt

# === Machine Learning ===
from sklearn.linear_model import LogisticRegression as SkLogisticRegression
from sklearn.metrics import (
    roc_auc_score,
    f1_score,
    precision_score,
    recall_score,
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    ConfusionMatrixDisplay,
)

# == Optbinning ==
from optbinning import OptimalPWBinning


# === Load Environment Variables ===
from dotenv import load_dotenv

load_dotenv()

# == Global Functions ==
from functions import *
import warnings
from optbinning import OptimalBinningSketch


warnings.filterwarnings("ignore", category=FutureWarning)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/lunlun/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mwlunlun1212[0m ([33mwlunlun1212-singapore-management-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/11 22:13:13 WARN Utils: Your hostname, Chengs-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.77 instead (on interface en0)
25/08/11 22:13:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/lunlun/Downloads/Github/Credit-Risk-Modeling-PySpark/venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/lunlun/.ivy2.5.2/cache
The jars for the packages stored in: /Users/lunlun/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependen

4.0.0


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/lunlun/.netrc


In [2]:
# == Constants ==

TARGET_COL = "default_status"
SAMPLE_FRAC = 0.1
SEED = 42
NOTEBOOK_RUN_NAME = "PD Model (Spark)"

In [3]:
# == Functions ==

In [4]:
# == Remove all existing runs in this notebook (everytime i run) ==

api = wandb.Api()
for run in api.runs(
    f"wlunlun1212-singapore-management-university/Credit Risk Modeling"
):
    if run.group == NOTEBOOK_RUN_NAME:
        run.delete()

In [5]:
df = spark.read.format("delta").load("../data/gold/feature_selection_next")

df.limit(10).toPandas()

25/08/11 22:13:20 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Unnamed: 0,id,loan_amnt,funded_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,...,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,credit_history_years,fico_score
0,7369532,33425.0,33425.0,60,15.1,796.94,C,4,RENT,75000.0,...,100.0,0.0,0.0,0.0,11649.0,5657.0,5000.0,0.0,10,742.0
1,6965099,7000.0,7000.0,36,18.25,253.95,D,4,OWN,50000.0,...,86.0,0.0,0.0,0.0,40127.0,4598.0,8400.0,27627.0,21,672.0
2,6975765,20000.0,20000.0,36,6.03,608.72,A,10,MORTGAGE,200000.0,...,98.0,0.0,0.0,0.0,527579.0,25040.0,51900.0,43306.0,40,782.0
3,6919540,2000.0,2000.0,36,9.71,64.27,B,10,MORTGAGE,40000.0,...,96.0,0.0,0.0,0.0,63733.0,9832.0,9600.0,19683.0,12,742.0
4,7091151,6075.0,6075.0,36,12.35,202.8,B,4,MORTGAGE,50000.0,...,88.0,0.0,0.0,0.0,168962.0,27368.0,4500.0,35241.0,17,752.0
5,6835512,8800.0,8800.0,36,15.88,308.87,C,8,RENT,60000.0,...,100.0,100.0,1.0,0.0,39203.0,19942.0,6000.0,31703.0,18,662.0
6,7089001,6500.0,6500.0,36,18.85,237.78,D,3,RENT,44000.0,...,98.0,100.0,1.0,0.0,79552.0,75059.0,2300.0,76552.0,10,662.0
7,6819237,15850.0,15850.0,60,13.68,366.18,C,3,MORTGAGE,64000.0,...,93.0,100.0,0.0,0.0,386747.0,18234.0,10000.0,13465.0,16,712.0
8,7369220,11300.0,11300.0,36,11.99,375.27,B,10,RENT,34500.0,...,100.0,0.0,1.0,0.0,19898.0,12306.0,9300.0,10598.0,10,672.0
9,7048791,35000.0,35000.0,60,15.22,836.7,C,10,MORTGAGE,84308.0,...,98.0,0.0,0.0,0.0,219294.0,190674.0,20500.0,192694.0,26,777.0


# 1. Feature Engineering


['acc_open_past_24mths',
 'annual_inc',
 'avg_cur_bal',
 'bc_open_to_buy',
 'bc_util',
 'dti',
 'fico_score',
 'inq_last_6mths',
 'int_rate',
 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_rcnt_tl',
 'mort_acc',
 'mths_since_recent_bc',
 'mths_since_recent_inq',
 'num_actv_rev_tl',
 'num_tl_op_past_12m',
 'percent_bc_gt_75',
 'revol_util',
 'term',
 'total_bc_limit',
 'verification_status',
 'loan_amnt/annual_inc',
 'loan_amnt/tot_hi_cred_lim',
 'id',
 'default_status',
 'issue_d']

# 2. Out of Time Train-Test Split


In [7]:
# 1) Convert to numeric timestamp
df_ts = df.withColumn("issue_ts", unix_timestamp(col("issue_d")))

# 2) Compute the 7th percentile of that timestamp
quantiles = df_ts.approxQuantile("issue_ts", [0.7], 0.01)
cut_ts = quantiles[0]  # e.g. 1672531200

# 3) Convert back to a human date
cut_date = df_ts.select(
    from_unixtime(lit(cut_ts), "yyyy-MM-dd").alias("cut_date")
).first()["cut_date"]

print(f"Splitting at ≈ {cut_date} ... ")

CUT_OFF_DATE = cut_date

Splitting at ≈ 2016-04-01 ... 


In [8]:
# == Out of Time Split ==
train_sdf, test_sdf = oot_train_test_split(df, CUT_OFF_DATE)

print("Train Dataset Proportion:")
train_sdf.groupBy(col("default_status")).count().show()

print("Test Dataset Proportion:")
test_sdf.groupBy(col("default_status")).count().show()

Train Dataset Proportion:
+--------------+------+
|default_status| count|
+--------------+------+
|             1|155426|
|             0|755923|
+--------------+------+

Test Dataset Proportion:
+--------------+------+
|default_status| count|
+--------------+------+
|             1| 80592|
|             0|311144|
+--------------+------+



# 3. WoE Binning


In [9]:
from optbinning import OptimalBinning
import pandas as pd


def woe_bin_transform_train(
    df, non_mono_cols, target_col="default_status", monotonic="auto"
):
    """
    Train OptBinning models for each feature in df (Pandas DataFrame)
    and replace original columns with WOE-transformed versions.

    Parameters:
    -----------
    df : pandas.DataFrame
        Training data (must include target_col)
    non_mono_cols : list
        List of features that are non-monotonic
    target_col : str
        Name of the target column (binary)
    monotonic : str
        Monotonicity constraint for OptBinning ("auto", "ascending", etc.)

    Returns:
    --------
    transformed_df : pandas.DataFrame
        DataFrame with WOE-transformed features
    optb_dict : dict
        Dictionary mapping column names to fitted OptimalBinning objects
    """

    excluded = ["id", "issue_d", target_col, "earliest_cr_line"]
    features = [col for col in df.columns if col not in excluded]

    optb_dict = {}

    for col in features:
        # Set monotonic trend
        trend = "auto_asc_desc" if col in non_mono_cols else monotonic

        # Detect dtype
        if df[col].dtype == "object":
            dtype = "categorical"
        elif pd.api.types.is_numeric_dtype(df[col]):
            dtype = "numerical"
        else:
            continue  # skip unsupported types

        # Train binning model
        optb = OptimalBinning(
            name=col,
            dtype=dtype,
            monotonic_trend=trend,
            solver="cp",
            min_bin_size=0.05,
            max_n_bins=5,
        )

        try:
            optb.fit(df[col], df[target_col])
            df[col + "_woe"] = optb.transform(df[col], metric="woe")
            df = df.drop(columns=[col])
            optb_dict[col] = optb
        except Exception as e:
            print(f"❌ Error fitting {col}: {e}")

    return df, optb_dict


def apply_woe_transform(df, optb_dict):
    """
    Apply pre-trained OptimalBinning models to transform features into WOE.
    """
    for col, optb in optb_dict.items():
        df[col + "_woe"] = optb.transform(df[col], metric="woe")
        df = df.drop(columns=[col])
    return df

In [10]:
# Convert Spark → Pandas
train_pdf = train_sdf.toPandas()

# Train WOE binning
train_pdf_woe, optb_dict = woe_bin_transform_train(train_pdf, [])

# Convert back to Spark
train_sdf_woe = spark.createDataFrame(train_pdf_woe)

# == WoE transform test_sdf
test_pdf = test_sdf.toPandas()
test_pdf = apply_woe_transform(test_pdf, optb_dict)
test_sdf_woe = spark.createDataFrame(test_pdf)

In [11]:
train_sdf_woe.limit(5).toPandas()

25/08/11 22:14:36 WARN TaskSetManager: Stage 41 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.


Unnamed: 0,id,default_status,issue_d,acc_open_past_24mths_woe,annual_inc_woe,avg_cur_bal_woe,bc_open_to_buy_woe,bc_util_woe,dti_woe,fico_score_woe,...,mths_since_recent_inq_woe,num_actv_rev_tl_woe,num_tl_op_past_12m_woe,percent_bc_gt_75_woe,revol_util_woe,term_woe,total_bc_limit_woe,verification_status_woe,loan_amnt/annual_inc_woe,loan_amnt/tot_hi_cred_lim_woe
0,87023,0,2007-06-01,0.115676,-0.150403,0.016607,0.077862,0.098392,0.220962,-0.266033,...,0.121147,0.089956,0.043271,0.097973,0.022701,0.339628,0.059528,0.368549,-0.533768,0.091375
1,98276,0,2007-07-01,0.115676,-0.150403,0.016607,0.077862,0.098392,0.430396,0.876276,...,0.121147,0.089956,0.043271,0.097973,0.329043,0.339628,0.059528,0.368549,-0.533768,0.346181
2,92402,0,2007-07-01,0.115676,0.02646,0.016607,0.077862,0.098392,0.220962,0.876276,...,0.121147,0.089956,0.043271,0.097973,0.146274,0.339628,0.059528,0.368549,0.490607,0.346181
3,109355,0,2007-07-01,0.115676,-0.150403,0.016607,0.077862,0.098392,0.430396,-0.266033,...,0.121147,0.089956,0.043271,0.097973,-0.147876,0.339628,0.059528,0.368549,0.490607,0.599597
4,92187,0,2007-07-01,0.115676,0.195359,0.016607,0.077862,0.098392,0.220962,0.876276,...,0.121147,0.089956,0.043271,0.097973,0.329043,0.339628,0.059528,0.368549,0.490607,0.599597


In [12]:
test_sdf_woe.limit(5).toPandas()

25/08/11 22:14:37 WARN TaskSetManager: Stage 42 contains a task of very large size (7492 KiB). The maximum recommended task size is 1000 KiB.


Unnamed: 0,id,default_status,issue_d,acc_open_past_24mths_woe,annual_inc_woe,avg_cur_bal_woe,bc_open_to_buy_woe,bc_util_woe,dti_woe,fico_score_woe,...,mths_since_recent_inq_woe,num_actv_rev_tl_woe,num_tl_op_past_12m_woe,percent_bc_gt_75_woe,revol_util_woe,term_woe,total_bc_limit_woe,verification_status_woe,loan_amnt/annual_inc_woe,loan_amnt/tot_hi_cred_lim_woe
0,75971272,0,2016-04-01,-0.128031,-0.150403,-0.162449,-0.129179,0.256152,0.220962,0.436392,...,0.121147,0.205306,0.370959,0.280331,0.329043,0.339628,-0.205928,-0.202021,0.260036,-0.092252
1,76372047,0,2016-04-01,0.115676,0.336114,0.49512,0.077862,0.256152,0.220962,0.876276,...,0.121147,0.205306,0.177778,0.280331,0.329043,0.339628,-0.121859,-0.073536,0.260036,-0.092252
2,75699430,0,2016-04-01,0.115676,-0.150403,0.21324,-0.129179,0.256152,-0.016305,-0.04725,...,0.121147,0.205306,0.370959,0.280331,0.329043,0.339628,-0.205928,-0.202021,0.015825,0.346181
3,75924220,0,2016-04-01,-0.128031,-0.086911,0.016607,0.077862,0.098392,-0.548363,0.173325,...,-0.2577,0.205306,-0.233952,0.097973,0.329043,0.339628,-0.205928,-0.073536,0.490607,0.346181
4,77239965,0,2016-04-01,0.384432,-0.086911,0.49512,0.077862,0.098392,-0.016305,0.876276,...,0.121147,0.205306,0.177778,0.097973,0.146274,0.339628,-0.205928,-0.202021,0.260036,0.346181


In [17]:
train_sdf_woe.count()

25/08/11 22:23:11 WARN TaskSetManager: Stage 153 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

911349

# 4. Model Training


In [21]:
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.functions import vector_to_array

from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.sql.types import NumericType
import numpy as np
import pandas as pd
import wandb
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import (
    precision_recall_curve,
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)


def train_eval_logistic_with_threshold_spark(
    train_sdf,
    test_sdf,
    run_name,
    run_group,
    model_type="Logistic Regression",
    weight_col=None,
    labels=["Non-Default", "Default"],
):
    """
    Spark version of train_eval_logistic_with_threshold.
    Converts Spark prob output to Pandas for threshold tuning.
    """

    # === 1. Start W&B run ===
    wandb.init(
        entity="wlunlun1212-singapore-management-university",
        project="Credit Risk Modeling",
        name=run_name,
        group=run_group,
    )

    # === 2. Select numeric features ===
    exclude = {"id", "issue_d", "earliest_cr_line", "default_status"}
    feature_cols = [
        f.name
        for f in train_sdf.schema.fields
        if (f.name not in exclude)
        and (isinstance(f.dataType, NumericType) or isinstance(f.dataType, VectorUDT))
    ]

    # === 3. Assemble + Standardize ===
    assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_vec")


    # === 4. Logistic Regression ===
    lr = LogisticRegression(
        featuresCol="features_vec",
        labelCol="default_status",
        weightCol=weight_col if weight_col else None,
        maxIter=1000,
        regParam=1.0,  # ~ C=1 in sklearn
        elasticNetParam=0.0,
    )

    pipeline = Pipeline(stages=[assembler, lr])
    model = pipeline.fit(train_sdf)

    # === 5. Predict on test ===
    preds_sdf = model.transform(test_sdf)

    preds_sdf = preds_sdf.withColumn(
        "prob_default", vector_to_array(F.col("probability"))[1]
    )

    preds_pdf = preds_sdf.select("default_status", "prob_default").toPandas()
    y_true = preds_pdf["default_status"].values
    y_proba = preds_pdf["prob_default"].values

    # === 6. Threshold tuning via PR curve ===
    precision, recall, thresholds = precision_recall_curve(y_true, y_proba)
    f1_scores = 2 * precision[1:] * recall[1:] / (precision[1:] + recall[1:] + 1e-12)

    best_idx = int(np.argmax(f1_scores))
    best_thresh = float(thresholds[best_idx])
    best_f1 = float(f1_scores[best_idx])

    print(f"\n✅ Best F1 Score = {best_f1:.4f} at threshold = {best_thresh:.2f}")

    # === 7. Final prediction using optimal threshold ===
    y_pred_opt = (y_proba >= best_thresh).astype(int)

    # === 8. Confusion Matrix ===
    cm = confusion_matrix(y_true, y_pred_opt)
    plt.figure(figsize=(6, 4))
    sns.heatmap(
        cm, annot=True, fmt="d", cmap="Blues", xticklabels=labels, yticklabels=labels
    )
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Optimized Confusion Matrix (F1)")
    plt.tight_layout()
    cm_fig = plt.gcf()
    plt.close()

    # === 9. F1 vs Threshold plot ===
    plt.figure(figsize=(7, 4))
    plt.plot(thresholds, f1_scores, label="F1 Score")
    plt.axvline(
        best_thresh,
        color="red",
        linestyle="--",
        label=f"Best Threshold: {best_thresh:.2f}",
    )
    plt.xlabel("Threshold")
    plt.ylabel("F1 Score")
    plt.title("F1 Score vs. Threshold")
    plt.legend()
    f1_fig = plt.gcf()
    plt.close()

    # === 10. Metrics ===
    accuracy = accuracy_score(y_true, y_pred_opt)
    precision_final = precision_score(y_true, y_pred_opt)
    recall_final = recall_score(y_true, y_pred_opt)
    f1_final = f1_score(y_true, y_pred_opt)
    auc = roc_auc_score(y_true, y_proba)
    gini = 2 * auc - 1

    print("\n📄 Classification Report (Optimized Threshold):")
    print(classification_report(y_true, y_pred_opt, target_names=labels, digits=4))

    report_dict = classification_report(
        y_true, y_pred_opt, target_names=labels, output_dict=True
    )
    report_df = pd.DataFrame(report_dict).T.round(4)
    report_md = report_df.to_markdown()

    # === 11. Log to WandB ===
    wandb.log(
        {
            "Model Type": model_type,
            "Best Threshold": best_thresh,
            "Gini": gini,
            "Accuracy": accuracy,
            "Precision": precision_final,
            "Recall": recall_final,
            "F1 Score": f1_final,
            "Confusion Matrix (sns)": wandb.Image(cm_fig),
            "F1 vs Threshold": wandb.Image(f1_fig),
            "Classification Report (Markdown)": wandb.Html(f"<pre>{report_md}</pre>"),
        }
    )
    wandb.finish()

    return {
        "best_threshold": best_thresh,
        "best_f1": best_f1,
        "accuracy": accuracy,
        "precision": precision_final,
        "recall": recall_final,
        "f1": f1_final,
        "auc": auc,
        "gini": gini,
    }

In [22]:
def add_class_weightage_cols(train_df) -> DataFrame:
    """
    Implement same logic as 'balanced' class in sklearn (give more importance to rare class during training)

    Adds class_weight column to TRAIN_DATASET
    """

    # Count examples in each class
    major_count = train_df.filter(train_df.default_status == 0).count()
    minor_count = train_df.filter(train_df.default_status == 1).count()
    total_count = train_df.count()

    # Calculate weights (inverse frequency)
    weight_for_0 = total_count / (2 * major_count)
    weight_for_1 = total_count / (2 * minor_count)

    # Add a column for sample weights
    train_df = train_df.withColumn(
        "class_weight_col",
        F.when(train_df.default_status == 0, weight_for_0).otherwise(weight_for_1),
    )

    return train_df

train_sdf_woe = add_class_weightage_cols(train_sdf_woe)
test_sdf_woe  = test_sdf_woe.withColumn("class_weight_col", lit(1.0))


train_eval_logistic_with_threshold_spark(train_sdf_woe, test_sdf_woe, 'log_reg_spark_pd_final_model', NOTEBOOK_RUN_NAME, 'Logistic Regression',
                                         'class_weight_col')

25/08/11 22:26:46 WARN TaskSetManager: Stage 168 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:48 WARN TaskSetManager: Stage 171 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:48 WARN TaskSetManager: Stage 174 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.


25/08/11 22:26:51 WARN TaskSetManager: Stage 177 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:52 WARN TaskSetManager: Stage 179 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:53 WARN TaskSetManager: Stage 181 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:54 WARN TaskSetManager: Stage 183 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:54 WARN TaskSetManager: Stage 185 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:54 WARN TaskSetManager: Stage 187 contains a task of very large size (17614 KiB). The maximum recommended task size is 1000 KiB.
25/08/11 22:26:54 WARN TaskSetManager: Stage 189 contains a task of very large size (17614 KiB). The maximum recommended task size is 10


✅ Best F1 Score = 0.4205 at threshold = 0.43

📄 Classification Report (Optimized Threshold):
              precision    recall  f1-score   support

 Non-Default     0.8736    0.6265    0.7297    311144
     Default     0.3107    0.6502    0.4205     80592

    accuracy                         0.6313    391736
   macro avg     0.5922    0.6383    0.5751    391736
weighted avg     0.7578    0.6313    0.6661    391736



0,1
Accuracy,▁
Best Threshold,▁
F1 Score,▁
Gini,▁
Precision,▁
Recall,▁

0,1
Accuracy,0.63134
Best Threshold,0.4308
F1 Score,0.42051
Gini,0.38358
Model Type,Logistic Regression
Precision,0.31074
Recall,0.65018


{'best_threshold': 0.43080072760846655,
 'best_f1': 0.42051240896370984,
 'accuracy': 0.6313384524271448,
 'precision': 0.31074461523863744,
 'recall': 0.6501761961485011,
 'f1': 0.42051072162300973,
 'auc': 0.6917895904207318,
 'gini': 0.3835791808414637}