---
title: Building a Credit Scoring Model using Supervised ML (INSTRUCTOR VERSION)
week: 3
author: Praveen Kumar
date: 2025-10-07
version: v1.0
instructor_only: true
---
# Week 3: Building a Credit Scoring Model using Supervised ML
**INSTRUCTOR VERSION — All solutions and teaching notes included.**

In [None]:
# Parameters
SEED = 42
SAMPLE_MODE = True
DATA_PATH = 'data/synthetic/credit_scoring.csv'

In [None]:
# Setup: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

## Load Dataset
Load synthetic or Kaggle credit scoring dataset.

In [None]:
# INSTRUCTOR ONLY: Load data
if SAMPLE_MODE:
    df = pd.read_csv(DATA_PATH)
else:
    # INSTRUCTOR: Load Kaggle dataset if available
    df = pd.read_csv(DATA_PATH)

## EDA: Class Balance & Missing Values
Check target distribution and missing values.

In [None]:
# INSTRUCTOR ONLY: EDA
print('Class balance:')
print(df['target'].value_counts(normalize=True))
print('Missing values:')
print(df.isnull().sum())
print('Summary stats:')
print(df.describe())

## Preprocessing
Encode categorical features, standardize numerical features, handle imbalance.

In [None]:
# INSTRUCTOR ONLY: Encode categorical features
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
    df[col] = LabelEncoder().fit_transform(df[col].astype(str))
# Standardize numerical features
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.drop('target')
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

In [None]:
# INSTRUCTOR ONLY: Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)

## Train Models
Logistic Regression and Random Forest.

In [None]:
# INSTRUCTOR ONLY: Logistic Regression
lr = LogisticRegression(class_weight='balanced', random_state=SEED)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
y_pred_lr_proba = lr.predict_proba(X_test)[:, 1]

In [None]:
# INSTRUCTOR ONLY: Random Forest
rf = RandomForestClassifier(class_weight='balanced', random_state=SEED)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_pred_rf_proba = rf.predict_proba(X_test)[:, 1]

## Evaluate Models
Confusion Matrix, ROC Curve, Classification Report.

In [None]:
# INSTRUCTOR ONLY: Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
cm_rf = confusion_matrix(y_test, y_pred_rf)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
sns.heatmap(cm_lr, annot=True, fmt='d', ax=axes[0], cmap='Blues')
axes[0].set_title('Logistic Regression Confusion Matrix')
sns.heatmap(cm_rf, annot=True, fmt='d', ax=axes[1], cmap='Greens')
axes[1].set_title('Random Forest Confusion Matrix')
plt.show()

In [None]:
# INSTRUCTOR ONLY: ROC Curve
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_lr_proba)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression')
plt.plot(fpr_rf, tpr_rf, label='Random Forest')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

In [None]:
# INSTRUCTOR ONLY: Classification Report
print('Logistic Regression:')
print(classification_report(y_test, y_pred_lr))
print('Random Forest:')
print(classification_report(y_test, y_pred_rf))

## Feature Importance & Interpretation
Visualize feature importance and interpret coefficients.

In [None]:
# INSTRUCTOR ONLY: Logistic Regression Coefficients
coef_df = pd.DataFrame({'feature': X.columns, 'coef': lr.coef_[0]})
coef_df = coef_df.sort_values('coef', key=abs, ascending=False)
print(coef_df.head(5))

In [None]:
# INSTRUCTOR ONLY: Random Forest Feature Importance
fi_df = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_})
fi_df = fi_df.sort_values('importance', ascending=False)
print(fi_df.head(5))

In [None]:
# INSTRUCTOR ONLY: Save predictions
preds = pd.DataFrame({'y_true': y_test, 'lr_pred': y_pred_lr, 'rf_pred': y_pred_rf})
preds.to_csv('/kaggle/working/predictions_week03.csv', index=False)

## Exercises & Solutions
1. SMOTE oversampling and ROC-AUC comparison:
   - INSTRUCTOR ONLY: Use imblearn.SMOTE to balance classes, retrain models, and compare ROC-AUC.
2. Precision-Recall Curve for both models:
   - INSTRUCTOR ONLY: Plot precision-recall curves for Logistic Regression and Random Forest.
3. SHAP feature importance for top 5 features:
   - INSTRUCTOR ONLY: Use SHAP to compute and plot feature importance for both models.

## Teaching Notes
- **Logistic Regression** is preferred for interpretability and regulatory settings.
- **Random Forest** is robust to non-linearities and less prone to overfitting.
- **Common Pitfalls**: Imbalanced datasets, data leakage, improper encoding.
- **Interpretation**: Use SHAP for feature impact, confusion matrix for error analysis.