In [58]:
import numpy as np
import pandas as pd
from pathlib import Path
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

## Initial Predictions
There are several considerations regarding the data.  It is an inbalanced data set with over 10 features, which theoretically would be handled best by Random Forest Classifier.  However, a quick, informed decision is usually desirable in a financial setting, which is a strong point of Logistic Regression.

---

Based on these, I predict that random forest will provide the best results because of the size of the data set and number of features, which would be too noisy for logitic regression.

In [27]:
# Convert categorical data to numeric and separate target feature for training data
X_train = train_df.drop('loan_status', axis=1)

X_train_dummies = pd.get_dummies(X_train)


In [28]:
# Convert categorical data to numeric and separate target feature for testing data
X_test = test_df.drop('loan_status', axis=1)

X_test_dummies = pd.get_dummies(X_test)


In [29]:
y_label_train = LabelEncoder().fit_transform(train_df['loan_status'])

y_label_test = LabelEncoder().fit_transform(test_df['loan_status'])

In [31]:
# add missing dummy variables to testing set
for col in X_train_dummies.columns:
    if col not in X_test_dummies.columns:
        X_test_dummies[col]=0

In [71]:
# Train the Logistic Regression model on the unscaled data and print the model score
clf_logistic  = LogisticRegression(solver='sag', max_iter=1000).fit(X_train_dummies, y_label_train) 

print(f"LR (unscaled) Training Data Score: {clf_logistic.score(X_train_dummies, y_label_train)}")
print(f"LR (unscaled) Testing Data Score: {clf_logistic.score(X_test_dummies, y_label_test)}");

LR (unscaled) Training Data Score: 0.6286535303776684
LR (unscaled) Testing Data Score: 0.5325393449595917




In [61]:
# Get best parameters for Forest Classifier
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train_dummies, y_label_train)
print (CV_rfc.best_params_)

{'max_features': 'log2', 'n_estimators': 700}


In [72]:
# Train a Random Forest Classifier model and print the model score
clf_forest = RandomForestClassifier(random_state=100, n_estimators=700, max_features= 'log2' ).fit(X_train_dummies, y_label_train)

print(f'RF (unscaled) Training Score: {clf_forest.score(X_train_dummies, y_label_train)}')
print(f'RF (unscaled) Testing Score: {clf_forest.score(X_test_dummies, y_label_test)}')

RF (unscaled) Training Score: 1.0
RF (unscaled) Testing Score: 0.6095278604849


## Initial Prediction Analysis and Scaled Predictions
As suspected, RF outperformed LR, but only minimally.  The largest difference is between the training scores.

---

Scaling the data should have some impact on LR's modeling capabilities, because of the number of features.  This will bring features tha different into the same ballpark, so to speak.  

I still predict that RF will outperform LR because RF handles noisy data well already, and reducing that noise will only boost its performance.

In [63]:
# Scale the data
scaler = StandardScaler().fit(X_train_dummies)
X_train_scaled = scaler.transform(X_train_dummies)
X_test_scaled = scaler.transform(X_test_dummies)

In [67]:
# Train the Logistic Regression model on the scaled data and print the model score
clf_lr_scaled = LogisticRegression(max_iter=300).fit(X_train_scaled, y_label_train)


print(f'Training Score: {clf_lr_scaled.score(X_train_scaled, y_label_train)}')
print(f'Testing Score: {clf_lr_scaled.score(X_test_scaled, y_label_test)}')

Training Score: 0.7128899835796387
Testing Score: 0.7201190982560612


In [68]:
# Get best parameters for Forest Classifier
rfc2 = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc2, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train_scaled, y_label_train)
print (CV_rfc.best_params_)

{'max_features': 'log2', 'n_estimators': 200}


In [70]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf_rf_scaled = RandomForestClassifier(random_state=100, n_estimators=200, max_features="log2").fit(X_train_scaled, y_label_train)

print(f'Training Score: {clf_rf_scaled.score(X_train_scaled, y_label_train)}')
print(f'Testing Score: {clf_rf_scaled.score(X_test_scaled, y_label_test)}')

Training Score: 1.0
Testing Score: 0.6052743513398554


## Final Thoughts
Surprisingly LR had the best final model, after the data was scaled.  RF remained mostly unchanged by the data being scaled, which matches initial prediction that it handles noise well.  Where it performed differently, was less noise did not translate to a better score.

---

In this scenario, I would use LR with a scaled data set as it provided the highest accuracy with a test score of .72, meaning that nearly 3 out of 4 loan requests are correctly classified with the right risk score, an acceptable measure that rules out human emotion in deciding whether or not to give a loan.
