# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Retrieve the Data



In [2]:
# Import the data

df = pd.read_csv('Resources/lending_data.csv')
df.head(5)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance


After having a look over the csv, the data is all numeric and it looks like there is a linear relationship between the parameters (Particularly interest rate) and the loan status. A Logistic Regression model tends to performs better with numerical data while the Random Forest model tends to fit better to categorical data. This leads me to believe the Logistic Regression model will perform better.

## Split the Data into Training and Testing Sets

In [3]:
# Define the X (features) and y (target) sets

X = df.drop('loan_status', axis=1)
y = df['loan_status'].values

# Split the data into X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [4]:
# Normalise the data

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models


In [5]:
# Train a Logistic Regression model and print the model score

clf = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9942908240473243
Testing Score: 0.9936545604622369


In [6]:
# Train a Random Forest Classifier model and print the model score

clf = RandomForestClassifier(random_state=1, n_estimators=50).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9974893382858715
Testing Score: 0.9910751134956666


Both models had a similar outcome. The Logistic Regression model had a lower training score but had more accuracy with the testing data, whereas the Random Forest Model had a better training score but a less accurate testing score.

In my initial predictions, I had thought that the Logistic Regression Model would outperform the Random Forest Classifier model (Which it did)2, what I did not expect there to be so little difference between the two.

I believe with more time and different hyper parameters, I may be able to improve both models which may lead to the Random Forest Classifier being more accurate in the end.
