# Classifying Loan Data 

### Summary
**Big Picture Summary:** The purpose of this notebook is to develop a machine learning model for the conservative investor in Lending Club. The data used in this notebook was cleaned in the [Data_Wrangling.ipynb](https://github.com/paulb17/Springboard/blob/master/Capstone%20Project%201/Data_Wrangling%20.ipynb) notebook. Following this the data was explored in the notebook [Data_Exploration.ipynb](https://github.com/paulb17/Springboard/blob/master/Capstone%20Project%201/Data_Exploration.ipynb). 

...


## Importing the data

In [21]:
# importing relevant packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# importing useful functions
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_val_predict, KFold, train_test_split 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals.joblib import parallel_backend


# creating plots using seaborn setting 
sns.set()

# using jupyter magic to display plots in line
%matplotlib inline

# importing the dataset
loan_data = pd.read_csv('Wrangled_Loan_data.csv', low_memory=False)

## Machine Learning Models
Prior to creating machine learning models, it is important to determine a measure of accuracy that will be used to compaere it with other models. Below we outline the metric chosen for comparing the data.

### Method of Comparison

The sensitivity and specificity will be used as metrics to determine how worthwhile the final model is to the conservative investor. For this problem, sensitivy is interpreted as the number of loans that the model correctly predicts will be fully paid as a percentage of the total number of loans that is fully paid. Addtionally, specificity is the number of loans that the model incorrectly predicts is fully paid as a percentage of the total number of loans that are charged off.

Since this problem is viewed from the standpoint of a conservative investor, false positives should be treated differently than false negatives. Conservative investors would want to minimize risk, and avoid false positives as much as possible: they would not mind missing out on opportunities (false negatives) as much as they would mind funding a risky loan (false positives). Consequently, the specificity metric will be more important than the sensitivity metric.

In order to identify the best model to use for the conservative investor each machine learning algorithm will be compared using the AUROC (Area Under the Receiver Operating Characteristics)

### Creating the baseline model
A logistic regression model using the default sklearn parameters will serve as a baseline model. 

In [3]:
# creating list of features
loan_features = loan_data.drop(['loan_status'], axis =1)
features = list(loan_features.columns)

# train test set split
X_train, X_test, y_train, y_test = train_test_split(loan_data[features], loan_data["loan_status"], 
                                                    train_size = 0.75, test_size = 0.25, 
                                                    random_state = 42)

# instantiating baseline logistic regression model
base_lr_model = LogisticRegression(solver = 'liblinear')

# setting the number of folds
kf = KFold(10, random_state = 1)
    
# fitting the model and computing predictions
base_lr_model.fit(X_train, y_train)
base_prediction_train = cross_val_predict(base_lr_model, X_train, y_train, cv=kf)

# calcluating the AUROC for the training set
base_train_roc = roc_auc_score(y_train, base_prediction_train)
print('For the training set the ROCAUC is:' + str(base_train_roc)+ '\n' )

# predicting results on test set
base_prediction_test = base_lr_model.predict(X_test)

# calculating the AUROC for the test set
base_test_roc = roc_auc_score(y_test, base_prediction_test)
print('For the test set the ROCAUC is:' + str(base_test_roc))

For the training set the ROCAUC is:0.500978236157

For the test set the ROCAUC is:0.500299868073


The ROCAUC score of the baseline model indicates that it is unable to distinguish charged off and fully paid loans. Below the confusion matrix of the training and test set predictions are shown:

In [4]:
# creating confusion matrix for the training set 
cm = confusion_matrix(y_train, base_prediction_train)
print('For the training set the confusion matrix is:' + '\n' + str(cm))
       
# create confusion matrix for the test set
cm= confusion_matrix(y_test, base_prediction_test)
print('\n'*2 + 'For the test set' + ' the confusion matrix is:' + '\n'+ str(cm))


For the training set the confusion matrix is:
[[   10  4249]
 [   10 25533]]


For the test set the confusion matrix is:
[[   1 1394]
 [   1 8538]]


We see that most of the borrowers in the baseline model are predicted to fully pay off their loans. This is as a result of the class imbalance in the dataset. There are 6 times more borrowers with loans that were paid off on time (1), than there are with loans that were charged off (0). To account for this, a grid search will be carried out using different weights. 

Below, we start by creating a function to implement this gridseearch. 

### Creating a grid search function to fit and test models
The function created below takes in a classification model and tunes hyper parameters.

In [5]:
def classification_model(model, param_grid, score = 'roc_auc'):
    
    # setting the number of folds
    kf = KFold(10, random_state=1)
    
    # Instantiate the GridSearchCV object: cv
    model_cv = GridSearchCV(model, param_grid, cv = kf, scoring = score, 
                            return_train_score = True)
    
    with parallel_backend('threading'):
        # Fitting the training set 
        model_cv.fit(X_train, y_train)
        
        # Predict the labels of the test set: y_pred        
        y_pred = model_cv.predict(X_test)


    # Compute and print metrics
    print("Tuned Model Parameters: {}".format(model_cv.best_params_))
    
    return pd.DataFrame(model_cv.cv_results_)


### Logistic Regression with varying class weights and regularization parameter
Below we tune the class weights to account for the class imbalance in the dataset. Furthermore, we also attempt to tune the regularization parameter to get a better understanding of its effect on the model. 

In [6]:
# c _space
c_space = np.logspace(-3, 3, 7)

# parameters for grid search 
param_grid = {'C': c_space, 'class_weight':[{0:5, 1:1}, {0:6, 1:1}, 'balanced', {0:7,1:1}, {0:8,1:1}]}

# calculating results
%time results = classification_model(base_lr_model, param_grid)  

Tuned Model Parameters: {'C': 100.0, 'class_weight': {0: 7, 1: 1}}
CPU times: user 12min 50s, sys: 24.2 s, total: 13min 14s
Wall time: 3min 47s


In [7]:
# columns of importance
result_columns = ['rank_test_score', 'params', 'mean_test_score', 'mean_train_score']
results[result_columns].sort_values('rank_test_score').head(10)

Unnamed: 0,rank_test_score,params,mean_test_score,mean_train_score
28,1,"{'C': 100.0, 'class_weight': {0: 7, 1: 1}}",0.702363,0.708108
18,2,"{'C': 1.0, 'class_weight': {0: 7, 1: 1}}",0.702309,0.708546
8,3,"{'C': 0.01, 'class_weight': {0: 7, 1: 1}}",0.702177,0.707658
33,4,"{'C': 1000.0, 'class_weight': {0: 7, 1: 1}}",0.701619,0.708253
13,5,"{'C': 0.1, 'class_weight': {0: 7, 1: 1}}",0.701601,0.708452
23,6,"{'C': 10.0, 'class_weight': {0: 7, 1: 1}}",0.701329,0.707336
22,7,"{'C': 10.0, 'class_weight': 'balanced'}",0.701035,0.705778
11,8,"{'C': 0.1, 'class_weight': {0: 6, 1: 1}}",0.700066,0.705039
32,9,"{'C': 1000.0, 'class_weight': 'balanced'}",0.700043,0.705415
6,10,"{'C': 0.01, 'class_weight': {0: 6, 1: 1}}",0.700005,0.70454


We see here that having the penalty for misclassifying charged off loans to be 7 times the penalty for misclassifying fully paid loans appears to have the best AUROC values on average. In addition, we note that while varying the regularization parameter "C" has an affect on ROCAUC values, the differences it makes in AUROC scores is small. 

The overall best AUROC value on the test set belongs to the model with {'C': 1.0, 'class_weight': {0: 7, 1: 1}}. The value is 0.702. Let's look into whether another model could further improve the AUROC of the test set.

### Random Forest Classification
There are a lot of parameters that may be worth tuning. To investigate this, we start by creating a function to randomly search through a selected grid of random forest parameters. Following this, the parameters that appear to be the most important will then be tuned using grid search.   

**Creating function for randomized searching**

In [17]:
def random_search(model, random_grid, score = 'roc_auc', iterations = 30):
    
    # setting the number of folds
    kf = KFold(10, random_state=1)
    
    # Instantiate the GridSearchCV object: cv
    model_cv = RandomizedSearchCV(model, random_grid, n_iter = iterations, cv = kf, 
                                  scoring = score, return_train_score = True, n_jobs=-3)
    

    # Fitting the training set 
    model_cv.fit(X_train, y_train)
        
    # Predict the labels of the test set: y_pred        
    y_pred = model_cv.predict(X_test)

    # Compute and print metrics
    print("Tuned Model Parameters: {}".format(model_cv.best_params_))
    
    return pd.DataFrame(model_cv.cv_results_)

In [23]:
# creating model object and evaluating result 
rf_model = RandomForestClassifier()

# parameters for grid search 
random_grid = {'class_weight':[{0:6, 1:1}, 'balanced', {0:7,1:1}],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [50, 100, 200, None],
               'min_samples_split': [2, 5, 8],
               'min_samples_leaf': [1, 2, 5],
               'n_estimators': [200, 250],
               'bootstrap' : [True, False]}

# calculating results
%time random_rf_results = random_search(rf_model, random_grid, iterations = 20)  

Tuned Model Parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 5, 'max_features': 'auto', 'max_depth': 200, 'class_weight': 'balanced', 'bootstrap': True}
CPU times: user 12.2 s, sys: 736 ms, total: 12.9 s
Wall time: 28min 25s


In [24]:
pd.options.display.max_colwidth = 300
random_rf_results[result_columns].sort_values('rank_test_score')

Unnamed: 0,rank_test_score,params,mean_test_score,mean_train_score
19,1,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 5, 'max_features': 'auto', 'max_depth': 200, 'class_weight': 'balanced', 'bootstrap': True}",0.700272,0.996171
4,2,"{'n_estimators': 250, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'class_weight': {0: 6, 1: 1}, 'bootstrap': True}",0.700234,0.999999
6,3,"{'n_estimators': 250, 'min_samples_split': 8, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': None, 'class_weight': {0: 6, 1: 1}, 'bootstrap': False}",0.699625,0.999998
7,4,"{'n_estimators': 250, 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 100, 'class_weight': {0: 6, 1: 1}, 'bootstrap': False}",0.699134,0.999999
9,5,"{'n_estimators': 250, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'class_weight': 'balanced', 'bootstrap': True}",0.698953,1.0
10,6,"{'n_estimators': 250, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 200, 'class_weight': {0: 7, 1: 1}, 'bootstrap': True}",0.697327,1.0
0,7,"{'n_estimators': 200, 'min_samples_split': 8, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 200, 'class_weight': {0: 6, 1: 1}, 'bootstrap': False}",0.697214,1.0
18,8,"{'n_estimators': 250, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 100, 'class_weight': {0: 6, 1: 1}, 'bootstrap': True}",0.697197,1.0
5,9,"{'n_estimators': 250, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 100, 'class_weight': {0: 7, 1: 1}, 'bootstrap': True}",0.696939,1.0
14,10,"{'n_estimators': 250, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None, 'class_weight': {0: 7, 1: 1}, 'bootstrap': True}",0.696299,1.0


# ----------------------------------------------------------------------------------------

In [78]:
# creating list of features
loan_features = loan_data.drop(['loan_status'], axis =1)
features = list(loan_features.columns)
log_features = loan_features.copy()
log_features['annual_inc'] = np.log(log_features['annual_inc'])

# train test set split on log features
logX_train, logX_test, logy_train, logy_test = train_test_split(log_features, loan_data["loan_status"], 
                                                    train_size = 0.75, test_size = 0.25, 
                                                    random_state = 42)

# instantiating baseline logistic regression model
log_lr_model = LogisticRegression(class_weight= 'balanced')

# setting the number of folds
kf = KFold(10, random_state = 1)
    
# fitting the model and computing predictions
log_lr_model.fit(logX_train, logy_train)
log_prediction_train = cross_val_predict(log_lr_model, logX_train, logy_train, cv=kf)

# calcluating the AUROC for the training set
log_train_roc = roc_auc_score(logy_train, log_prediction_train)
print('For the training set the ROCAUC is:' + str(log_train_roc)+ '\n' )

# predicting results on test set
log_prediction_test = log_lr_model.predict(X_test)

# calculating the AUROC for the test set
log_test_roc = roc_auc_score(logy_test, log_prediction_test)
print('For the test set the ROCAUC is:' + str(log_test_roc))

For the training set the ROCAUC is:0.65035295325

For the test set the ROCAUC is:0.5


In [None]:
# creating confusion matrix for the training set 
cm = confusion_matrix(y_train, predictions_train)
print('For the training set the confusion matrix is:' + '\n' + str(cm))

# calculating sensitivity for the training set
sensitivity_train = cm[1,1]/(cm[1,1]+cm[1,0])
print('The sensitivity of the training set is: ' + str(sensitivity_train))

# calculating specificity for the training set
specificity_train = cm[0,1]/(cm[0,1]+cm[0,0])
print('The specificity of the training set is: ' + str(specificity_train))


    
# predicting results on test set
prediction_test = base_lr_model.predict(X_test)
       
# create confusion matrix for the test set
cm= confusion_matrix(y_test, prediction_test)
print('\n'*2 + 'For the test set' + ' the confusion matrix is:' + '\n'+ str(cm))

# calculating sensitivity of the test set
sensitivity_test = cm[1,1]/(cm[1,1]+cm[1,0])
print('The sensitivity of the test set is: ' + str(sensitivity_test))

# calculating specificity of the test set
specificity_test = cm[0,1]/(cm[0,1]+cm[0,0])
print('The specificity of the test set is: ' + str(specificity_test))
'''