In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [2]:
# Import the data
df = pd.read_csv('Resources/lending_data.csv')
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [6]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.preprocessing import StandardScaler

X = df.drop(['loan_status'], axis=1)
y = df['loan_status'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X.sample(5)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
14811,9400.0,7.137,47800,0.372385,3,0,17800
51853,8900.0,6.923,45700,0.343545,3,0,15700
67298,7200.0,6.187,38800,0.226804,1,0,8800
64017,11300.0,7.948,55400,0.458484,5,1,25400
53406,8600.0,6.792,44500,0.325843,3,0,14500


The model that I predict will perform better is the Logistic Regression model. This is because the Random Forest Classifier (RFC) model performs better with more categorical data than numeric, whereas our data largely consists of numeric data. In addition, the Logisitic Regression model performs well where the numerical features of the test data are outside of the range of the training data. This could likely be the case with our dataset as the loan_size, borrower_income and total_debt features can largely differ for each observation.

In [10]:
# Train a Logistic Regression model print the model score
from sklearn.linear_model import LogisticRegression

log = LogisticRegression(max_iter=10000)
log.fit(X_train_scaled, y_train)

print(f"Training Data Score: {log.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {log.score(X_test_scaled, y_test)}")

Training Data Score: 0.9941188609162196
Testing Data Score: 0.9941704498555509


In [8]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

ran = RandomForestClassifier(random_state=42)
ran.fit(X_train_scaled, y_train)

print(f"Training Data Score: {ran.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {ran.score(X_test_scaled, y_test)}")

Training Data Score: 0.9971798046498831
Testing Data Score: 0.9917457697069748


Both models performed well, however the Logistic Regression model slightly outperformed the Random Forest Classifier (RFC) model. Although the training data score for the RFC model was the highest, the difference between that and the test data score was larger than for the Logistic Regression model. This is likely due to the values of the numerical features for the test data being far from the range in the training data. Therefore, the models performed in line with my prediction for the reasons stated above. 