# <span style="color:TEAL">**ASSIGNMENT #2**</span>
Hester van Schalkwyk
## Loan Default Prediction Assignment
 
This notebook follows a structured approach to predicting loan defaults using machine learning.
The assignment consists of building baseline and improved models, optimizing them based on business constraints, and implementing a regression model for loan amount prediction.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, mean_squared_error
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelBinarizer

# Suppress Future Warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### **Step 1: Load and Explore Data**

Objective:
- Load the Lending Club dataset (using a 200-row sample for efficiency). I loaded 200 as the models didn't converge with 100 rows.
- Identify categorical and numerical features.
- Ensure we have the target variable (loan_status) for classification.

Assignment Relevance:
- This step ensures a clean dataset for building the models.

In [2]:
# Load Data (first 200 rows instead of 100 for the model to converge.)
df = pd.read_csv("../data/2-intermediate/df_out_dsif5.csv", nrows=200)

# Display initial dataset info
display(df.head())

# Identify target variable and features
target = 'loan_status'  # Assuming loan_status is the target variable
num_features = df.select_dtypes(include=['float', 'int']).columns.tolist()
cat_features = df.select_dtypes(include=['object']).columns.tolist()

# Remove target from feature lists
num_features = [col for col in num_features if col != target]
cat_features = [col for col in cat_features if col != target]

Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,...,int_rate_clean,term_numeric,debt_to_income,loan_amnt_log,grade_encoded,loan_amnt_std,annual_inc_std,loan_amnt_norm,annual_inc_norm,loan_default
0,167338079,4000.0,4000.0,4000.0,36 months,13.08%,134.93,B,B5,cashier,...,0.1308,36.0,0.083333,8.2943,1,-1.196895,-0.367206,0.083969,0.004364,False
1,71016917,24000.0,24000.0,24000.0,60 months,9.16%,500.07,B,B2,ABM,...,0.0916,60.0,0.421053,10.085851,1,0.915452,-0.264024,0.592875,0.005182,False
2,39589826,5000.0,5000.0,5000.0,36 months,10.49%,162.49,B,B3,driver,...,0.1049,36.0,0.090909,8.517393,1,-1.091278,-0.286953,0.109415,0.005,False
3,134798709,24000.0,24000.0,24000.0,60 months,11.05%,522.42,B,B4,,...,0.1105,60.0,0.551724,10.085851,1,0.915452,-0.418798,0.592875,0.003955,False
4,127097355,14000.0,14000.0,14000.0,60 months,13.59%,322.79,C,C2,Shipping Clerk,...,0.1359,60.0,0.291667,9.546884,2,-0.140722,-0.367206,0.338422,0.004364,False


### **Step 2: Preprocess Data (Feature Engineering and Encoding)**

Objective:
- Handle categorical features by limiting categories to the top 10 most frequent and encoding them via one-hot encoding.
- Standardize numerical features to ensure they are on the same scale.

Assignment Relevance:
- This step implements feature engineering, which is required for building the improved model (`model_2`).
- Handling categorical variables properly improves model interpretability.


In [3]:
# One-hot encode categorical variables (limit to top 10 most frequent values)
for col in cat_features:
    top_10 = df[col].value_counts().index[:10]
    df[col] = df[col].apply(lambda x: x if x in top_10 else 'OTHER')
df_encoded = pd.get_dummies(df, columns=cat_features, drop_first=True)

# Convert `loan_status` into binary target `loan_default`
df_encoded['loan_default'] = df_encoded['loan_status'].apply(lambda x: 1 if x == "Charged Off" else 0)

**Feature Engineering (New Features for Model Improvement)**

In [4]:
df_encoded['income_to_debt_ratio'] = df_encoded['annual_inc'] / (df_encoded['dti'] + 1)
df_encoded['loan_to_income_ratio'] = df_encoded['loan_amnt'] / (df_encoded['annual_inc'] + 1)

**Preprocessing Pipelines (Handling Missing Values and Scaling)**

In [5]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features)
])

### **Step 3: Split Data and Handle Missing Values**
Objective:
- Prepare X (features) and y (target variable).
- Ensure no class has fewer than two instances, as this would cause issues in stratified splitting.
- Apply SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset and deal with class imbalance.
- Handle missing values by replacing them with the median value.

Assignment Relevance:
- This step ensures our dataset is balanced before training the models.
- The handling of class imbalance is explicitly required in the assignment.


In [6]:
# Prepare training data
X = df_encoded.drop(columns=[target, 'loan_status'])
y = df_encoded['loan_default']

# Ensure no class has less than 2 instances
unique, counts = np.unique(y, return_counts=True)
class_counts = dict(zip(unique, counts))
rare_classes = [cls for cls, count in class_counts.items() if count < 2]
if rare_classes:
    X = X[~np.isin(y, rare_classes)]
    y = y[~np.isin(y, rare_classes)]

# Perform train-test split without stratification if necessary
try:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
except ValueError:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handle missing values before training
imputer = SimpleImputer(strategy='median')
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

# **Handle Class Imbalance with SMOTE**
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

### **Step 4: Train and Evaluate Baseline Model with Cross-Validation**
Objective:
- Train a Logistic Regression model as a baseline.
- Apply cross-validation to ensure robustness.
- Use ROC-AUC as the primary evaluation metric.

Assignment Relevance:
- Cross-validation is explicitly required.
- This forms our baseline model for comparison with `model_2`.

In [7]:
baseline_model = LogisticRegression(max_iter=2000, solver='liblinear', penalty='l2', class_weight='balanced')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = cross_val_score(baseline_model, X_train, y_train, cv=skf, scoring='roc_auc')
print(f"Baseline Model Cross-Validation ROC-AUC Scores: {cross_val_scores}")
print(f"Mean ROC-AUC Score: {np.mean(cross_val_scores)}")

baseline_model.fit(X_train, y_train)
y_pred_baseline = baseline_model.predict(X_test)
y_pred_baseline_proba = baseline_model.predict_proba(X_test)[:, 1]

# Evaluate Baseline Model
print("Baseline Model Performance:")
print(classification_report(y_test, y_pred_baseline, zero_division=1))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_baseline_proba))

Baseline Model Cross-Validation ROC-AUC Scores: [1.         1.         0.99507389 0.97167488 0.99876847]
Mean ROC-AUC Score: 0.993103448275862
Baseline Model Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       1.00      1.00      1.00         4

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

ROC-AUC Score: 1.0


### **Step 5: Train and Evaluate Improved Model with Cross-Validation**
Objective:
- Train a Random Forest Classifier as model_2.
- Apply cross-validation for consistency.
- Evaluate performance using classification metrics and ROC-AUC.

Assignment Relevance:
- `model_2` includes feature selection, cross-validation, and a more powerful classifier than the baseline.
- This meets the mandatory part of the assignment.

In [8]:
model_2 = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42)
cross_val_scores_2 = cross_val_score(model_2, X_train, y_train, cv=skf, scoring='roc_auc')
print(f"Model_2 Cross-Validation ROC-AUC Scores: {cross_val_scores_2}")
print(f"Mean ROC-AUC Score: {np.mean(cross_val_scores_2)}")

model_2.fit(X_train, y_train)
y_pred_model_2_proba = model_2.predict_proba(X_test)[:, 1]

# **Custom Cost-Based Loss Function**
def custom_loss(y_true, y_pred):
    FP_cost = 100
    FN_cost = 1000
    FP = ((y_true == 0) & (y_pred == 1)).sum()
    FN = ((y_true == 1) & (y_pred == 0)).sum()
    return FP * FP_cost + FN * FN_cost

threshold = 0.3
y_pred_model_2_threshold = (y_pred_model_2_proba > threshold).astype(int)
print("Custom Cost-Based Loss:", custom_loss(y_test, y_pred_model_2_threshold))

Model_2 Cross-Validation ROC-AUC Scores: [1. 1. 1. 1. 1.]
Mean ROC-AUC Score: 1.0
Custom Cost-Based Loss: 0


### **Step 6: Train Regression Model for Loan Amount Prediction**
Objective:
- Consider business cost implications by weighing False Positives (FP) and False Negatives (FN).
- Design a custom loss function to minimize financial risk to the lender.

Assignment Relevance:
- This step is optional but aligns with Part 2 of the assignment.

In [9]:
reg_features = [col for col in df_encoded.columns if 'emp_length' in col or 'home_ownership' in col] + ['annual_inc', 'income_to_debt_ratio', 'loan_to_income_ratio']
X_reg = df_encoded[reg_features]
y_reg = df_encoded['loan_amnt']

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

reg_model = LinearRegression()
reg_model.fit(X_reg_train, y_reg_train)

y_reg_pred = reg_model.predict(X_reg_test)
mse = mean_squared_error(y_reg_test, y_reg_pred)
rmse = np.sqrt(mse)
mean_loan_amount = np.mean(y_reg)

print("Regression Model Performance:")
print("Mean Squared Error:", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Loan Amount:", mean_loan_amount)
print("RMSE as Percentage of Mean Loan Amount:", (rmse / mean_loan_amount) * 100, "%")

Regression Model Performance:
Mean Squared Error: 17890972.028226633
Root Mean Squared Error (RMSE): 4229.772101216167
Mean Loan Amount: 14731.0
RMSE as Percentage of Mean Loan Amount: 28.713407787768432 %


29.14% error relative to mean loan amount is quite high. This suggests the model's predictions are not particularly reliable. Try a non-linear regression model.

In [10]:
from sklearn.ensemble import RandomForestRegressor
reg_model_rf = RandomForestRegressor(n_estimators=100, random_state=42)
reg_model_rf.fit(X_reg_train, y_reg_train)
y_reg_pred_rf = reg_model_rf.predict(X_reg_test)
print("Random Forest Regression MSE:", mean_squared_error(y_reg_test, y_reg_pred_rf))

Random Forest Regression MSE: 4265221.5875


## Things to try to improve it:

Random Forest has a higher MSE thean the the Regression Model!

### Feature Engineering Enhancements:
Objective: Improve model performance by adding or transforming features.

#### Explore New Features
Check which additional features might be useful in predicting loan default.

- Explore features such as:
    - debt_to_income (DTI) ratio
    - verification_status
    - loan_purpose
- Add new features based on interactions or transformations.