# Credit Risk Classification

Credit risk poses a classification problem that’s inherently imbalanced. This is because healthy loans easily outnumber risky loans. In this notebook, we’ll use various techniques to train and evaluate models with imbalanced classes. We’ll use a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

## Steps:

* Split of the Data into Training and Testing Sets

* Creation of a Logistic Regression Model with the Original Data

* Prediction of a Logistic Regression Model with Resampled Training Data.

* Comparison between the two models.

* Summary Report: Results are included in the README file, which in this case act as the Final Report (Overview, Results, and Summary)

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df = pd.read_csv(Path('Resources/lending_data.csv'))
df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


### Step 2: Create the labels and feature sets (y, X)

In [4]:
label = 'loan_status'
y = df[label]
X = df.drop(columns=label)

In [5]:
# Review the target variable Series
y.head()

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [6]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Check the balance of the labels

In [28]:
print("Class distribution in the target variable.")
y.value_counts()

Class distribution in the target variable.


0    75036
1     2500
Name: loan_status, dtype: int64

In [8]:
y.shape

(77536,)

### Step 4: Split the data into training and testing datasets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.25)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model to the training data

In [13]:
# Model Instantiation and training

model = LogisticRegression(random_state=1)

model.fit(X_train, y_train)

LogisticRegression(random_state=1)

### Step 2: Calculate predictions on the testing data labels

In [15]:
y_pred_test = model.predict(X_test)

### Step 3: Evaluate the model’s performance

In [16]:
balanced_accuracy_score(y_test, y_pred_test)

0.9520479254722232

In [17]:
print("Confusion Matrix:")
confusion_matrix(y_test, y_pred_test)

Confusion Matrix


array([[18663,   102],
       [   56,   563]])

In [18]:
# Confusion matrix considers all values 1s&0s
confusion_matrix(y_test, y_pred_test).sum()==len(y_pred_test)

True

In [19]:
print(" \n                Classification Report with The Umbalaced Data Model")
print(classification_report_imbalanced(y_test, y_pred_test))

 
                Classification Report with The Umbalaced Data Model
                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.91      1.00      0.95      0.91     18765
          1       0.85      0.91      0.99      0.88      0.95      0.90       619

avg / total       0.99      0.99      0.91      0.99      0.95      0.91     19384



### Step 4: Logistic Regression Raw Model Assessment

At this point, we can assess how well the logistic regression model predicts both the `0` (healthy loan) and `1` (high-risk loan) labels.

### Healthy Loan Predictions
Healthy loans are predicted with a precision and recall of 100%, indicating perfect performance for this class. High-risk loans, however, have a precision of 85% and a recall of 91%. While this is strong, we aim to improve these metrics, particularly the precision, by addressing data imbalance. The F1 score for high-risk loans is 88%, which reflects a reasonable balance between precision and recall but leaves room for improvement.

### Recall of High-Risk Loans
A recall of 91% means that out of 635 defaulted loans, 578 (=0.91 × 635) were correctly classified as high-risk by the model, while 57 defaulted loans were misclassified as healthy. This implies that 57 high-risk loans were wrongly approved, potentially leading to the costs associated with defaults.

### Precision of High-Risk Loans
With a precision of 85%, and 578 true positives (from the previous calculation), a total of 680 (=578 / 0.85) loans were classified as high-risk by the model. Of these, 102 (=0.15 × 680) healthy loans were misclassified as high-risk. This implies an opportunity cost: 102 healthy loans were not approved, potentially leading to missed profitable opportunities.



---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. 

In [20]:
from imblearn.over_sampling import RandomOverSampler

# Instantiation
random_oversampler = RandomOverSampler(random_state=1)

# Fit the original training data to the random_oversampler model
X_resampled, y_resampled = random_oversampler.fit_resample(X_train, y_train)


In [27]:
print(" Class counting became balanced")
y_resampled.value_counts()


 Class counting became balanced


0    56271
1    56271
Name: loan_status, dtype: int64

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [22]:
# Instantiate the Logistic Regression model
model_oversampled = LogisticRegression(random_state=1)

# Fit the model using the resampled training data
model_oversampled.fit(X_resampled, y_resampled)

# Make a prediction using the testing data
y_pred_resampled = model_oversampled.predict(X_test)

### Step 3: Evaluate the model’s performance

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [23]:
# Print the balanced_accuracy score of the model 
balanced_accuracy_score(y_test, y_pred_resampled)

0.9936781215845847

In [25]:
print("Confusion Matrix")
confusion_matrix(y_test, y_pred_resampled)

Confusion Matrix


array([[18649,   116],
       [    4,   615]])

In [26]:
# Print the classification report for the model
print(" \n             Classification Report with Re-balaced Data Model")
print(classification_report_imbalanced(y_test, y_pred_resampled))

 
             Classification Report with Re-balaced Data Model
                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.99      0.99      1.00      0.99      0.99     18765
          1       0.84      0.99      0.99      0.91      0.99      0.99       619

avg / total       0.99      0.99      0.99      0.99      0.99      0.99     19384



### Step 4: Assessment of the Logistic Regression Model with Resampled Data

Let's review the classification report for the logistic regression model fit with oversampled data. We will assess how well the model predicts both the `0` (healthy loan) and `1` (high-risk loan) labels.

---

### High-Risk Loans

The model performed significantly better in predicting high-risk loans after resampling. In particular, there is an outstanding improvement in the recall, from 0.91 to 1.00! This means that all false negatives (loans incorrectly classified as healthy when they were actually high-risk) are now correctly labeled as `1` (high-risk loans). This is crucial for managing the cost associated with loan defaults. Without resampling, the bank would have to absorb the default cost of 57 high-risk loans that were misclassified by the previous model. Fortunately, by applying resampling to address the data imbalance, the cost of defaults is now entirely prevented by the improved model.

However, the opportunity cost associated with the 102 loans incorrectly classified as high-risk remains. Resampling did not improve the precision for identifying healthy loans, as the precision score remains at 0.85 in both models.

---

### Healthy Loans

Healthy loans are well classified in both models. Since the initial classification of healthy loans was already highly accurate using the original data, the addition of oversampling did not significantly improve the classification performance.

---

### Conclusions

There is a substantial improvement in the overall accuracy of the logistic regression model for classifying loan risks when using resampled data. The recall for high-risk loans is now at maximum performance, eliminating the risk of misclassifying default-prone loans.

A full analysis is available in the README file. Thank you for reading.
