In [1]:
# Step 0: Install imbalanced-learn if not available
!pip install imbalanced-learn



# Credit Risk Analysis (Give Me Some Credit Dataset)

## Objective
Predict whether a borrower will **default on a loan within 2 years** using ML.  

**Target:** `SeriousDlqin2yrs`  
- `1` = Default (High risk)  
- `0` = No Default (Low risk)  

---

## Challenges
- Dataset is **imbalanced** (most people repay, few default).  
- Accuracy alone is misleading → use **Balanced Accuracy & Recall**.  

---

## Approach
1. Upload & explore dataset.  
2. Preprocess data (handle missing values, scaling).  
3. Train models:
   - Logistic Regression (baseline)  
   - Logistic Regression + SMOTE (oversampling)  
   - Balanced Random Forest  
   - Easy Ensemble Classifier  
4. Compare results.  


In [2]:
# Step 1: Imports
import pandas as pd
from collections import Counter

# Preprocessing & Models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.combine import SMOTEENN

# Advanced Models
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier

# Metrics
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from imblearn.metrics import classification_report_imbalanced


In [3]:
# STEP 2: Upload Dataset
from google.colab import files
uploaded = files.upload()


Saving cs-training.csv to cs-training.csv


In [4]:
# STEP 3: Load Dataset
import pandas as pd

df = pd.read_csv("cs-training.csv")
df.head()


Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


## Dataset Information
Columns:
- `SeriousDlqin2yrs` → Target (default within 2 years)  
- `RevolvingUtilizationOfUnsecuredLines` → Credit utilization ratio  
- `age` → Borrower’s age  
- `NumberOfTime30-59DaysPastDueNotWorse` → Late payments (30–59 days)  
- `DebtRatio` → Debt / income ratio  
- `MonthlyIncome` → Borrower’s income  
- `NumberOfOpenCreditLinesAndLoans` → Open accounts  
- `NumberOfTimes90DaysLate` → 90+ days late payments  
- `NumberRealEstateLoansOrLines` → Real estate accounts  
- `NumberOfTime60-89DaysPastDueNotWorse` → Late payments (60–89 days)  
- `NumberOfDependents` → Dependents  


In [5]:
# STEP 4: Preprocess Data
# Handle missing values
df = df.dropna()

# Split features and target
X = df.drop("SeriousDlqin2yrs", axis=1)
y = df["SeriousDlqin2yrs"]

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train.shape, X_test.shape


((96215, 11), (24054, 11))

## Step 5: Logistic Regression (Baseline)
We start with Logistic Regression on the imbalanced dataset.  
This gives us a **baseline model** to compare with.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score, classification_report

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)

print("Balanced Accuracy (Logistic Regression):", balanced_accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))


Balanced Accuracy (Logistic Regression): 0.5192748552196136
              precision    recall  f1-score   support

           0       0.93      1.00      0.96     22383
           1       0.59      0.04      0.08      1671

    accuracy                           0.93     24054
   macro avg       0.76      0.52      0.52     24054
weighted avg       0.91      0.93      0.90     24054



## Step 6: Logistic Regression + SMOTE
To handle imbalance, we apply **SMOTE (Synthetic Minority Oversampling)**.  
This balances the training data by generating synthetic defaults.


In [7]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train_scaled, y_train)

log_reg_sm = LogisticRegression(max_iter=1000)
log_reg_sm.fit(X_res, y_res)
y_pred_sm = log_reg_sm.predict(X_test_scaled)

print("Balanced Accuracy (Logistic + SMOTE):", balanced_accuracy_score(y_test, y_pred_sm))


Balanced Accuracy (Logistic + SMOTE): 0.7107999699374309


## Step 7: Ensemble Models
Instead of resampling, we try models designed for imbalanced data:
- **Balanced Random Forest** → balances classes during training.  
- **Easy Ensemble Classifier** → trains multiple AdaBoost classifiers on balanced subsets.


In [8]:
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier

# Balanced Random Forest
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
y_pred_brf = brf.predict(X_test)

# Easy Ensemble Classifier
eec = EasyEnsembleClassifier(n_estimators=100, random_state=42)
eec.fit(X_train, y_train)
y_pred_eec = eec.predict(X_test)

print("Balanced Accuracy (Balanced RF):", balanced_accuracy_score(y_test, y_pred_brf))
print("Balanced Accuracy (Easy Ensemble):", balanced_accuracy_score(y_test, y_pred_eec))


Balanced Accuracy (Balanced RF): 0.7552775730426986
Balanced Accuracy (Easy Ensemble): 0.758546155548449


## Confusion Matrix & Imbalanced Classification Report

The confusion matrix shows **how many defaults (high-risk)** and **non-defaults (low-risk)** were correctly or incorrectly classified:

- **Rows = Actual values** (what really happened)  
- **Columns = Predicted values** (what the model guessed)  

For example:
- Top-left = Correctly predicted high-risk (True Positives).  
- Top-right = High-risk predicted as low-risk (False Negatives).  
- Bottom-left = Low-risk predicted as high-risk (False Positives).  
- Bottom-right = Correctly predicted low-risk (True Negatives).  

---

The **imbalanced classification report** goes beyond accuracy and includes:  

- **Precision (pre):** Out of all predicted positives, how many were correct?  
- **Recall (rec):** Out of all actual positives, how many did the model catch?  
- **Specificity (spe):** Out of all negatives, how many were identified correctly?  
- **F1 score:** Balance between precision and recall.  
- **Geo mean (geo):** Balance between sensitivity and specificity.  
- **IBA (Index of Balanced Accuracy):** Adjusted measure for imbalanced data.  
- **sup:** Support, i.e., number of samples in each class.  

This helps evaluate **how well the model identifies high-risk borrowers**, which is critical for banking applications.  


In [9]:
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced

# ---- For Balanced Random Forest ----
y_pred_brf = brf.predict(X_test)

matrix_brf = confusion_matrix(y_test, y_pred_brf)
cm_df_brf = pd.DataFrame(
    matrix_brf,
    index=["Actual High-Risk", "Actual Low-Risk"],
    columns=["Predicted High-Risk", "Predicted Low-Risk"]
)

print("Confusion Matrix - Balanced RF")
print(cm_df_brf, "\n")

print("Classification Report - Balanced RF")
print(classification_report_imbalanced(y_test, y_pred_brf))


Confusion Matrix - Balanced RF
                  Predicted High-Risk  Predicted Low-Risk
Actual High-Risk                19304                3079
Actual Low-Risk                   588                1083 

Classification Report - Balanced RF
                   pre       rec       spe        f1       geo       iba       sup

          0       0.97      0.86      0.65      0.91      0.75      0.57     22383
          1       0.26      0.65      0.86      0.37      0.75      0.55      1671

avg / total       0.92      0.85      0.66      0.88      0.75      0.57     24054



In [10]:
# ---- For Easy Ensemble ----
y_pred_eec = eec.predict(X_test)

matrix_eec = confusion_matrix(y_test, y_pred_eec)
cm_df_eec = pd.DataFrame(
    matrix_eec,
    index=["Actual High-Risk", "Actual Low-Risk"],
    columns=["Predicted High-Risk", "Predicted Low-Risk"]
)

print("Confusion Matrix - Easy Ensemble")
print(cm_df_eec, "\n")

print("Classification Report - Easy Ensemble")
print(classification_report_imbalanced(y_test, y_pred_eec))


Confusion Matrix - Easy Ensemble
                  Predicted High-Risk  Predicted Low-Risk
Actual High-Risk                17642                4741
Actual Low-Risk                   453                1218 

Classification Report - Easy Ensemble
                   pre       rec       spe        f1       geo       iba       sup

          0       0.97      0.79      0.73      0.87      0.76      0.58     22383
          1       0.20      0.73      0.79      0.32      0.76      0.57      1671

avg / total       0.92      0.78      0.73      0.83      0.76      0.58     24054



# Conclusion

- Both models outperform the **baseline Logistic Regression** on imbalanced credit risk data.  
- **Balanced Random Forest** achieves **higher precision** but misses more defaulters (recall ~65%).  
- **Easy Ensemble Classifier** achieves **higher recall (~73%)**, meaning it catches more high-risk borrowers — crucial for banking risk management.  
- Trade-off: Higher recall comes at the cost of more false alarms (flagging some safe customers as risky).  

### Key Business Insight
In banking, **missing a high-risk borrower (false negative)** is far more costly than flagging a safe one.  
Therefore, the **Easy Ensemble Classifier is the preferred model**, as it minimizes default risk by ensuring most defaulters are detected.  

Final Outcome:  
This project demonstrates how **ensemble learning methods significantly improve loan default prediction** compared to simple classifiers, and provides a framework banks can use to **reduce financial risk**.
