# Model Evaluation & Threshold Selection

## Objective
The objective of this notebook is to evaluate model performance using business-relevant metrics and select an operating threshold that balances default risk and loan approval volume.

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    precision_recall_curve
)

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("../data/processed/model_data.csv")

X = df.drop(columns=["default"])
y = df["default"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

model.fit(X_train_scaled, y_train)

y_test_proba = model.predict_proba(X_test_scaled)[:, 1]

In [3]:
roc_auc_score(y_test, y_test_proba)

np.float64(0.7002167729099027)

In [4]:
y_pred_05 = (y_test_proba >= 0.5).astype(int)

confusion_matrix(y_test, y_pred_05)

array([[179845,  89343],
       [ 25543,  41597]])

In [5]:
print(classification_report(y_test, y_pred_05))

              precision    recall  f1-score   support

           0       0.88      0.67      0.76    269188
           1       0.32      0.62      0.42     67140

    accuracy                           0.66    336328
   macro avg       0.60      0.64      0.59    336328
weighted avg       0.76      0.66      0.69    336328



In [6]:
y_pred_03 = (y_test_proba >= 0.3).astype(int)

confusion_matrix(y_test, y_pred_03)

array([[ 55374, 213814],
       [  3489,  63651]])

In [7]:
print(classification_report(y_test, y_pred_03))

              precision    recall  f1-score   support

           0       0.94      0.21      0.34    269188
           1       0.23      0.95      0.37     67140

    accuracy                           0.35    336328
   macro avg       0.59      0.58      0.35    336328
weighted avg       0.80      0.35      0.34    336328



In [8]:
def evaluate_threshold(threshold):
    y_pred = (y_test_proba >= threshold).astype(int)
    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    
    return {
        "threshold": threshold,
        "recall_default": report["1"]["recall"],
        "precision_default": report["1"]["precision"],
        "false_negatives": cm[1, 0],
        "false_positives": cm[0, 1]
    }

In [9]:
thresholds = [0.2, 0.3, 0.4, 0.5]

results = pd.DataFrame([evaluate_threshold(t) for t in thresholds])
results

Unnamed: 0,threshold,recall_default,precision_default,false_negatives,false_positives
0,0.2,0.993059,0.207145,466,255197
1,0.3,0.948034,0.229402,3489,213814
2,0.4,0.826393,0.266554,11656,152669
3,0.5,0.619556,0.31768,25543,89343


In [10]:
precision, recall, thresh = precision_recall_curve(y_test, y_test_proba)

In [11]:
# business constraint
MIN_RECALL = 0.94

# filter thresholds that meet recall requirement
eligible_thresholds = results[results["recall_default"] >= MIN_RECALL]

# choose threshold with highest precision among eligible ones
best_threshold = eligible_thresholds.sort_values(
    by="precision_default", ascending=False
).iloc[0]

best_threshold

threshold                 0.300000
recall_default            0.948034
precision_default         0.229402
false_negatives        3489.000000
false_positives      213814.000000
Name: 1, dtype: float64

In [12]:
FINAL_THRESHOLD = best_threshold["threshold"]
FINAL_THRESHOLD

np.float64(0.3)

In [13]:
y_final_pred = (y_test_proba >= FINAL_THRESHOLD).astype(int)

confusion_matrix(y_test, y_final_pred)
print(classification_report(y_test, y_final_pred))

              precision    recall  f1-score   support

           0       0.94      0.21      0.34    269188
           1       0.23      0.95      0.37     67140

    accuracy                           0.35    336328
   macro avg       0.59      0.58      0.35    336328
weighted avg       0.80      0.35      0.34    336328



### Threshold Selection Rationale

Given the asymmetric cost of errors in credit risk, the operating threshold was chosen to prioritize recall for default cases, reducing false negatives at the expense of higher false positives. This aligns with a conservative lending strategy focused on capital preservation.

### Programmatic Threshold Selection

Rather than selecting a threshold heuristically, a business constraint was imposed requiring a minimum recall of 94% for default cases. Among thresholds satisfying this constraint, the threshold maximizing precision was selected. This approach reflects real-world banking practices where risk appetite is defined first, followed by optimization within acceptable risk limits.

### Final Model Performance at Selected Threshold

At the selected threshold of 0.3, the model achieves approximately 95% recall for default cases, significantly reducing the likelihood of approving high-risk borrowers. Although overall accuracy is lower due to an increase in false positives, this trade-off is intentional and aligns with banking practices where the cost of default outweighs the cost of rejecting creditworthy applicants.

