In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
# Import the data
filepath = 'Resources/lending_data.csv'
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Logistic Regression vs Random Forest Classifier

The Logistic Regression and Random Forest Classifier are two common predictive algorithms. Random Forest performs well if the values of the numerical features of the test data is within or close to the range of the training data. On the other hand, the Logistic Regression performs well even if the numerical features of the test data are well outside the range of the training data. Both algorithms are reasonable candidates for the loan risk problem. The adequacy of each model is dependent on the specifics of each dataset. My prediction is that loan_status is linearly related to one or more of the features, therefore the Logistic Regression will have the best performance.

In [3]:
# clunky way to determine loan status split
# so, does 1 = late, zero = ontime?

one = 0
zero = 0

for i in df['loan_status']:   
    if i == 0:
        zero += 1
    else:
        one += 1
print(f'One: {one}, Zero: {zero}')

One: 2500, Zero: 75036


In [4]:
# clunky way to determine loan status split
# so, does 1 = late, zero = ontime?

one = 0
zero = 0

for i in df['loan_status']:   
    if i == 0:
        zero += 1
    else:
        one += 1
print(f'One: {one}, Zero: {zero}')


One: 2500, Zero: 75036


In [5]:
#checking for nulls
df.isnull().sum()

loan_size           0
interest_rate       0
borrower_income     0
debt_to_income      0
num_of_accounts     0
derogatory_marks    0
total_debt          0
loan_status         0
dtype: int64

In [6]:
# Split df to X and y
X = df.drop('loan_status', axis=1)
y = df['loan_status']

In [7]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [8]:
# scaling data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
# Train a Logistic Regression model print the model score
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Test Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9941876461686614
Test Score: 0.9939640940982254


In [13]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1, n_estimators=500)
clf.fit(X_train_scaled, y_train)

print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Test Score: {clf.score(X_test_scaled, y_test)}')


Training Score: 0.9973861604072087
Test Score: 0.991384647131655


## Results

Both models performed equally well at predicting loan risk (Logistic Regression test accuracy: 99.3%; Random Forest Classifier test accuracy 99.1%. I predicted that the Logistic Regression would preform better. While this was technically correct, I think the broad take away here is that both approaches are suitable. 