# Predicting Bad Loans with Machine Learning - Part 3: Modeling

### Content List
- [Imports](#Imports)
- [Read in CSV](#Read-in-CSV)
- [Defining Inputs](#Defining-Inputs)
- [Functions](#Functions)
- [Modeling: Logistic Regression](#Modeling:-Logistic-Regression)
- [Modeling: Random Forest](#Modeling:-Random-Forest)
- [Modeling: XGBoost](#Modeling:-XGBoost)
- [Conclusions](#Conclusions)

### Imports

In [37]:
#instantiate Logistic Regression and get up and running by the end of the day
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import time

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Import sklearn elements
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, r2_score, f1_score
from sklearn.ensemble import RandomForestClassifier

#import xgboost
from xgboost import XGBClassifier

### Read in CSV

In [3]:
#read in file from csv to dataframe
data_read = pd.read_csv('./cleaned_FEATURES.csv')


In [4]:
data = data_read

In [5]:
num_cols = data.select_dtypes(include=('int64', 'float64')).columns

In [6]:
list(num_cols)

['loan_amnt',
 'int_rate',
 'installment',
 'grade',
 'emp_length',
 'annual_inc',
 'dti',
 'delinq_2yrs',
 'fico_range_low',
 'fico_range_high',
 'inq_last_6mths',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'out_prncp',
 'out_prncp_inv',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_amnt',
 'last_fico_range_high',
 'last_fico_range_low',
 'collections_12_mths_ex_med',
 'acc_now_delinq',
 'tot_coll_amt',
 'tot_cur_bal',
 'acc_open_past_24mths',
 'avg_cur_bal',
 'bc_open_to_buy',
 'bc_util',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'mort_acc',
 'num_accts_ever_120_pd',
 'num_actv_bc_tl',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_bc_tl',
 'num_il_tl',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'num_tl_30dpd',
 'num_tl_90g_dpd_24m',
 'num_tl_op_past_12m',
 'pct_tl_nvr_dlq',
 'percent_bc_gt_75',
 'pub_rec_bankruptcies',
 'tax_liens',
 'tot_hi_cred_lim',
 'total_bal_ex_mort',
 'total_il_high_credit_limit',
 'classes',
 

In [7]:
data.dtypes.value_counts()

int64      85
float64    71
object      6
dtype: int64

## Defining Inputs

In [8]:
X = data[num_cols].drop(columns='classes')
y = data['classes']

In [9]:
X.dtypes.value_counts()

int64      84
float64    71
dtype: int64

In [10]:
# Create training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=420,
                                                    stratify = y)

In [11]:
X_train.shape

(847634, 155)

In [12]:
y_train.shape

(847634,)

## Functions

In [40]:
def metrics(model):
    preds          = model.predict(X_test) #generate predictions
    
    test_conf      = confusion_matrix(y_test,# True values.
                                  preds) # Predicted values.
    
    tn, fp, fn, tp = test_conf.ravel() #unravel values to use in metrics
    
    accuracy       = accuracy_score(y_test, preds)
    print("Accuracy score: %.2f%%" % (accuracy * 100.0))

    recall         = recall_score(y_test, preds)
    print("Recall score: %.2f%%" % (recall * 100.0))

    precision      = precision_score(y_test, preds)
    print("Precision score: %.2f%%" % (precision * 100.0))

    specificity    = tn / (tn+fp)
    specificity    = specificity * 100.0
    print(f"Specificity score: {round((specificity),2)}%")
    
    F1_SCORE        = f1_score(y_test, preds)
    F1_SCORE        = F1_SCORE * 100.0
    print(f'F1 score: {round((F1_SCORE),2)}%')
    
    df_conf= pd.DataFrame(test_conf, index =  ['Actual Failure', 'Actual Success'],
                    columns = ['Predicted Failure', 'Predicted Success'])
    return(df_conf)

In [16]:
def cv_score(model):
    cv_train = cross_val_score(model, X_train, y_train, cv=3).mean()
    cv_test = cross_val_score(model, X_test, y_test, cv=3).mean()
    print(f'Mean CV Score for Training: {cv_train}')
    print(f'Mean CV Score for Testing: {cv_test}')

In [57]:
avg_rate = data['int_rate'].mean()
avg_rate = round((avg_rate),2)
print(f'The Average Interest Rate is: {avg_rate}%')

avg_loan = data['loan_amnt'].mean()
avg_loan = round((avg_loan),2)
print(f'The Average Loan amount is: {avg_loan}$')

The Average Interest Rate is: 13.3%
The Average Loan amount is: 14821.58$


## Modeling: Logistic Regression

**Assumptions of Logistic Regression:**
- Linearity: The independent variables X1, . . . , Xm are linearly related to the logit of the probability that Y = 1 or, equivalently, the log-odds that Y = 1.
- Independence of Errors: The observations y1, . . . , yn are independent of one another.
- Distribution of Errors: Each observation yi follows a Bernoulli distribution with probability of success pi.
- Independence of Independent Variables: The independent variables X1, . . . , Xm are independent of one another.

In [13]:
# Step 1: Instantiate our model.
logreg = LogisticRegression(solver='liblinear')

# Step 2: Fit our model.
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [14]:
#return the intercept and coefficients
print(f'Logistic Regression Intercept: {logreg.intercept_}')
print(f'Logistic Regression Coefficient: {logreg.coef_}')

Logistic Regression Intercept: [6.32961328e-14]
Logistic Regression Coefficient: [[ 4.07904076e-10  5.96308983e-13  1.38685475e-11  3.97652170e-13
   3.65134606e-13  3.58958541e-09  9.55160985e-13  1.44740001e-14
   4.42965257e-11  4.45497139e-11  2.93043667e-14  6.55296536e-13
   1.43212357e-14  5.74296441e-10  2.94267183e-12  1.40966286e-12
  -8.16589498e-11 -8.16419450e-11 -8.55622428e-11 -1.43699681e-11
   7.19649478e-10  5.33124581e-11  5.63098828e-11  8.04785828e-16
   1.81214189e-16  1.84079718e-11  3.52829790e-09  2.31774505e-13
   3.75436627e-10  5.98756624e-10  3.48069342e-12  4.89812832e-16
  -1.17568426e-13  7.59389274e-14  3.01246166e-14  2.01801095e-13
   3.05684566e-13  2.69587830e-13  4.85434240e-13  4.47955437e-13
   4.80342780e-13  8.71571369e-13  3.05066804e-13  6.52654291e-13
   9.89937753e-17  4.57196640e-15  1.06423871e-13  5.96500407e-12
   2.45818434e-12  9.89397629e-15  2.54569938e-15  4.86727510e-09
   1.63626464e-09  1.40523224e-09  3.53115575e-10  1.21189313

In [17]:
#how to visualize these results?
print(f'Logistic Regression train score: {logreg.score(X_train, y_train)}')
print(f'Logistic Regression test score: {logreg.score(X_test, y_test)}')

Logistic Regression train score: 0.7890657996257818
Logistic Regression test score: 0.7891634918188024


In [44]:
cv_score(logreg)

CV Score for Training: 0.7894504068921031
CV Score for Testing: 0.7955884107197421


In [41]:
metrics(logreg)

Accuracy score: 78.92%
Recall score: 99.91%
Precision score: 78.91%
Specificity score: 1.36%
F1 score: 88.18%


Unnamed: 0,Predicted Failure,Predicted Success
Actual Failure,1055,76335
Actual Success,256,285626


In [45]:
data['classes'].value_counts(normalize=True)

1    0.786963
0    0.213037
Name: classes, dtype: float64

#### Model Interpretation: Logistic Regression
Logistic regression is my initial model, but barely performs better than the class percentage split of 78.70% Success (and 21.30% Failure). This means that for an Accuracy score of 78.92%, it performed merely 1.1% better than the split. 

Interestingly, the training accuracy and Cross Valuation score came in lower than the testing scores. This indicates that despite all of the features (existing and engineered), the model is not overfit. This means that when observing "new" data it performed better than the training set and theoretically means it is generalizable to other sets of data!

However, Accuracy isn't necessarily the best metric for my use case as it is the ratio of correctly predicted observations over total observations. The main concern is that bad loans will be perceived as good loans and the investors will lose money on their principal with potential defaults or incomplete repayments. In Confusion Matrix terminology, I am primarily looking to minimize the False Positives. The best metric in that regard is a combination of Specificity and Precision, so 1.36% (abysmal) and 78.91% (not great) are what we are going to be evaluating our next models on.

As the main concern is in the quantity of False Positives, the 76,000+ quantity is the amount I will be focusing on minimizing in subsequent models.

## Modeling: Random Forest

In [58]:
rf = RandomForestClassifier(max_depth= 5, max_features= 5, n_estimators= 100)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=5, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [None]:
# start = time.time()
# rf_params = {
#     'n_estimators': [50, 100],
#     'max_depth': [None, 3, 5],
#     'max_features': ['auto', 5,], 
#     'max_leaf_nodes': [None, 5, 10]
# }

# gs= GridSearchCV(rf, param_grid= rf_params)
# gs.fit(X_train, y_train)
# print(f'The Best Score is: {gs.best_score_}')
# print(f'The Best Parameters were: {gs.best_params_}')
# end = time.time()
# duration = round(end - start)
# print(f'It took {duration} seconds to find the optimal parameters')

In [59]:
cv_score(rf)

CV Score for Training: 0.9212962195771981
CV Score for Testing: 0.9200736516780316


In [60]:
print(f'Random Forest train score: {rf.score(X_train, y_train)}')
print(f'Random Forest test score: {rf.score(X_test, y_test)}')

Random Forest train score: 0.9236462907339724
Random Forest test score: 0.9231842806492105


In [61]:
metrics(rf)

Accuracy score: 92.32%
Recall score: 100.00%
Precision score: 91.11%
Specificity score: 63.94%
F1 score: 95.35%


Unnamed: 0,Predicted Failure,Predicted Success
Actual Failure,49485,27905
Actual Success,0,285882


In [65]:
feature_importance = pd.Series(data = rf.feature_importances_,
                              index = X.columns)
feature_importance.sort_values(ascending = False)

collection_recovery_fee                     1.796475e-01
last_fico_range_high*last_fico_range_low    1.497566e-01
recoveries*collection_recovery_fee          1.213029e-01
last_fico_range_high                        1.193393e-01
recoveries                                  1.101219e-01
last_fico_range_low                         8.682146e-02
last_pymnt_amnt                             7.606031e-02
out_prncp                                   3.531903e-02
debt_settlement_flag_Y                      2.532392e-02
out_prncp_inv                               2.194563e-02
int_rate                                    1.246862e-02
grade                                       1.040678e-02
term_ 60 months                             9.172313e-03
fico_range_high                             4.292976e-03
funded_amnt*installment                     2.640404e-03
fico_range_low                              2.633309e-03
int_rate*grade                              2.245459e-03
acc_open_past_24mths           

#### Model Interpretation: Random Forest
Across almost all metrics, Random Forest offers massive improvements over Logistic Regression. The only instance where that was not the case was in the fit of the model itself, as the training score was slightly higher than the testing score. This indicates that the model is slightly overfit, but as almost all Random Forest models have a tendency to massively overfit this is actually not a large concern.

Interestingly, this model has a perfect Recall score of 100% which is impressive! Despite not being the exact measure we are looking at, this still has relevance to our scenario as it represents potential earnings from interest payments. Since there are no Predicted Failures that ended up being Successes, or False Negatives, this model enables us to minimize losses from potential interest payments on the loans.

Despite that minor success, there is a larger concern inherent in the 30,000+ False Positives, or number of Predicted Successes that turned out to be Failures. This represents a much larger financial loss than lost interest payments, as the principal investment is also at risk. In the same vein as the analysis for Logistic Regression, we are looking at a combination of Precision and Specificity- both markedly improving from prior. Additionally, the nominal amount of False Positives is a large improvement, from 76,000 to 30,000, a 61% improvement!

### Modeling: XGBoost

In [26]:
xgb = XGBClassifier()

In [27]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [None]:
cv_score(xgb)

In [35]:
print(f'XGBoost train score: {xgb.score(X_train, y_train)}')
print(f'XGBoost test score: {xgb.score(X_test, y_test)}')

XGBoost train score: 0.9779362319114145
XGBoost test score: 0.9774906956770685


In [43]:
metrics(xgb)

Accuracy score: 97.75%
Recall score: 98.68%
Precision score: 98.46%
Specificity score: 94.32%
F1 score: 98.57%


Unnamed: 0,Predicted Failure,Predicted Success
Actual Failure,72992,4398
Actual Success,3779,282103


#### Model Interpretation: XGBoost
My third model is the runaway winner in all metrics, with the sole exception of fit. Our primary indicator of success, minimization of False Positives, is a substantial improvement over the Logistic Regression as well as the Random Forest models. Compared to Logistic Regression, XGBoost has an improvement of 95% and for Random Forest it is an improvement of 85%. 

One potential concern is that the quantity of False Negatives are much larger than in both prior models. However, if we consider the financial implication of the difference it is more than covered by the reduction in False Positives. 

## Conclusions
Throughout this process, I show that it is indeed possible to make accurate predictions regarding the quality of a loan as defined by Success or Failure. Out of the models tested, XGBoost is the clear choice for this, as it improves in all metrics (with the exception of Recall and Fit). 

For future implementations of this project, I recommend looking into further analysis of the textual elements via Natural Language Processing. Specifically, I would like to incorporate CountVectorizer into this model in order to utilize the raw text input as Chang et al did, rather than just measuring the length of the text provided for each of the three features. 

Additional feature engineering is also available for Zip code data, as it was not feasible to bring in additional data sets and join them in order to create mean and median incomes as Li & Han did.

In these models, I focused exclusively on binary classification problems. In reality, there are more than two classifications that exist for loans. In order to comprehensively cover these ranges of outcomes it would be ideal to transform this problem from a binary to multinomial classification. 

The size of the data caused the optimization process to be slowed dramatically, and drastically reduced my ability to perform GridSearch on models. To solve this issue it would be ideal if there was a small development set created in order to quickly and accurately fine tune the parameters to achieve optimization without having to subject the entire dataset to the process.