## Credit Risk Evaluator

### Background

* Lending services companies allow individual investors to partially fund personal loans as well as buy and sell notes backing the loans on a secondary market. This notebook aims to create machine learning models to classify the risk level of loans using the (#1) **Logistic Regression**, and (#2) **Random Forest Classifier**.


In [1]:
# Import relevant modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# Read the CSV file from the Resources folder and read using Panda DF
lending_data_df = pd.read_csv(Path('./Resources/lending_data.csv'))

# View the DF
lending_data_df.head(10)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
5,10100.0,7.438,50600,0.407115,4,1,20600,0
6,10300.0,7.49,51100,0.412916,4,1,21100,0
7,8800.0,6.857,45100,0.334812,3,0,15100,0
8,9300.0,7.096,47400,0.367089,3,0,17400,0
9,9700.0,7.248,48800,0.385246,4,0,18800,0


In [3]:
# Split the dataframe between y variable (target) and x variable (data)
y = lending_data_df['loan_status']
X = lending_data_df.drop(columns='loan_status')

# Assigned a target name for target values
target_names = ["Good Standing", "Default"]

In [4]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

In [5]:
# Check the target values
y.value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

### #1 Logical Regression Model

In [6]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=10, max_iter=10000)
classifier

LogisticRegression(max_iter=10000, random_state=10)

In [7]:
# Fit (train) our model by using the training data
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000, random_state=10)

In [8]:
# Validate the model by using the test data
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9920724996560737
Testing Data Score: 0.9920037144036319


In [9]:
# Create confusion matrix
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = classifier.predict(X_test)
confusion_matrix(y_true, y_pred, labels=[1,0])

array([[  591,    52],
       [  103, 18638]], dtype=int64)

In [10]:
# Print the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
print(f"True positives (TP): {tp} | True negatives (TN): {tn}")
print(f"False positives (FP): {fp}| False negatives (FN): {fn}")

True positives (TP): 591 | True negatives (TN): 18638
False positives (FP): 103| False negatives (FN): 52


In [11]:
accuracy = (tp + tn) / (tp + fp + tn + fn)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9920037144036319


### #2 Random Forest Classifier

In [12]:
# Prepare the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [13]:
# Import a Random Forest Classifier module from SKLearn
from sklearn.ensemble import RandomForestClassifier

In [20]:
# Fit a model, and then print a classification report
clf = RandomForestClassifier(random_state=10).fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

               precision    recall  f1-score   support

Good Standing       1.00      0.99      1.00     18741
      Default       0.85      0.91      0.88       643

     accuracy                           0.99     19384
    macro avg       0.92      0.95      0.94     19384
 weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9973689640940983
Testing Score: 0.9915910028889806


### Observations:

#### Logistic Regression Scores (Unscaled)
Training Data Score: 0.9920724996560737
<br>
Testing Data Score: 0.9920037144036319

#### Random Forest Classifer (Scaled)
Training Score: 0.9973689640940983
<br>
Testing Score: 0.9915910028889806

### Model Comparison Reflection
The resulting scores between the training and testing data on the **logistic regression** (unscaled) model are closer, with only 0.00007 difference, compared the to the resulting scores using the **random classifier** (scaled) model with 0.0057 difference.  The diminutive difference between the two model suggests that any of the two models can be used to predict the risk level of loans. Between the two, logistic rogression model appears to be a better model.
