## Modeling: Small business loans

I will now build a model to classify small business loans according to whether they will be paid off or default. This is a classification problem, and many of the variables have skewed distributions. Decision tree algorithms often perform well in such cases. I will try two decision-tree based algorithms: random forest, and gradient boosting. I will also tune hyperparameters for both algorithms.

### Loading the data

In [None]:
## Import needed Python module and functions 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [None]:
## Import features for training values
features = pd.read_csv('./Data/Processed/X_train.csv')

In [None]:
## Import labels for training data
labels = pd.read_csv('./Data/Processed/y_train.csv')

In [None]:
## Convert training data into numpy arrays to train models
feature_names = features.columns
X = features.values
y = labels.values
y = y.ravel()

### Random forest model

#### Baseline model

To get a benchmark for hyperparameter tuning, let's first see how the Random Forest algorithm performs just using the `sklearn` defaults.

In [7]:
rf_model = RandomForestClassifier(random_state=42)

In [8]:
cv_results_rf = cross_validate(rf_model, X, y, scoring=('f1', 'accuracy'), cv=3, n_jobs=1)

In [9]:
## Print out cross-validated metrics for the 'out of the box' model.
pd.DataFrame(cv_results_rf).mean()

fit_time         150.317626
score_time         8.107573
test_f1            0.780226
test_accuracy      0.929110
dtype: float64

#### Hyperparameter tuning

Now, let's see if I can do better with hyperparameter tuning. Since the data is unbalanced, I will use f1 score as the metric to optimize, rather than accuracy. While there are many hyperparameters I could tune, I will focus on `n_estimators`, `criterion` and `max_depth.`

In [10]:
## Set up grid with possible combinations of hyperparameters to search
params = {'n_estimators': [50, 100, 200],
          'criterion' : ['gini', 'entropy'],
          'max_depth': [50, 100, 200, None]
}

rf_for_search = RandomForestClassifier(random_state=42)

In [None]:
## Execute grid search to find the best hyperparameters
search_results_rf = GridSearchCV(estimator=rf_for_search, param_grid=params, 
                                 cv=3, scoring=('accuracy', 'f1'), 
                                 refit='f1', n_jobs=1, verbose=5)
search_results_rf.fit(X, y)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV 1/3] END criterion=gini, max_depth=50, n_estimators=50; accuracy: (test=0.928) f1: (test=0.777) total time= 1.3min
[CV 2/3] END criterion=gini, max_depth=50, n_estimators=50; accuracy: (test=0.929) f1: (test=0.778) total time= 1.4min
[CV 3/3] END criterion=gini, max_depth=50, n_estimators=50; accuracy: (test=0.929) f1: (test=0.780) total time= 1.3min
[CV 1/3] END criterion=gini, max_depth=50, n_estimators=100; accuracy: (test=0.928) f1: (test=0.779) total time= 2.6min
[CV 2/3] END criterion=gini, max_depth=50, n_estimators=100; accuracy: (test=0.929) f1: (test=0.781) total time= 2.7min
[CV 3/3] END criterion=gini, max_depth=50, n_estimators=100; accuracy: (test=0.929) f1: (test=0.781) total time= 2.7min
[CV 1/3] END criterion=gini, max_depth=50, n_estimators=200; accuracy: (test=0.929) f1: (test=0.781) total time= 5.3min
[CV 2/3] END criterion=gini, max_depth=50, n_estimators=200; accuracy: (test=0.930) f1: (test=0.782) t

In [None]:
scores_rf = pd.DataFrame(search_results_rf.cv_results_)[['param_criterion', 
                                                         'param_n_estimators', 
                                                         'param_max_depth', 
                                                         'mean_test_accuracy', 
                                                         'mean_test_f1', 
                                                         'mean_fit_time']]
scores_rf.sort_values(by='mean_test_f1', ascending=False).head(5)

In [None]:
## Print out the best hyperparameters found
search_results_rf.best_params_

In [None]:
best_rf = search_results_rf.best_estimator_

Now that I have chosen the best hyperparameters for the random forest model, I fit the model on the entire training set, and evaluate performance on the validation data.

In [None]:
%%time
## Fit our optimized random forest model to the training data
best_rf.fit(X, y)

In [None]:
X_val_df = pd.read_csv('./Data/Processed/X_test.csv')
y_val_df = pd.read_csv('./Data/Processed/y_test.csv')

In [None]:
## Load the validation data, and save features as an array and labels as a 
X_val = X_val_df.values
y_val = y_val_df.values
y_val = y_val.ravel()

In [None]:
%%time
## Use our model to make a prediction on the validation set
y_val_pred_rf = best_rf.predict(X_val)

In [None]:
## Print metrics for this model.
print("Accuracy: {:.2f}".format(accuracy_score(y_val_pred_rf, y_val)))
print("Precision: {:.2f}".format(precision_score(y_val_pred_rf, y_val)))
print("Recall: {:.2f}".format(recall_score(y_val_pred_rf, y_val)))
print("f1 score: {:.2f}".format(f1_score(y_val_pred_rf, y_val)))

Finally, let's examine the most important features of our model.

In [None]:
## Graph shows feature importances for this model
plt.subplots(figsize=(10, 5))
importances = best_rf.feature_importances_
labeled_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)[:10]
labeled_importances.plot(kind='bar')
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Best random forest model feature importances');

#### Conclusion: random forest

Random forest offers decent performance even with the default hyperparameters from `sklearn`. However, my efforts to improve performance with hyperparameter turning were not very successful. The optimized model performed only about one percentage point better on the target metric, but took more than three times longer to train. It's possible that the algorithm would do better with an even larger value of `n_estimators`, but this would mean longer training times. For now, I will try a different algorithm. 

### Gradient Boosting Algorithm

#### Evaluating the default model

I will now follow a similar process with gradient boosting. To being, I will evaluate the default version of the model.

In [None]:
gb_model = GradientBoostingClassifier(random_state=42)

In [None]:
cv_results_gb = cross_validate(gb_model, X, y, scoring=('f1', 'accuracy'), cv=3, n_jobs=1)

In [None]:
# Print cross-validated metrics for the 'out of the box' gradient boosting model
pd.DataFrame(cv_results_gb).mean()

It seems that gradient boosting gives similar performance to random forest with the default setting. Lets see if gradient boosting is more responsive to hyperparameter tuning.

#### Hyperparameter tuning

One again I conduct a grid search. This time instead of `criterion`, I vary the parameter `learning_rate`, which determines how quickly the boosting algorithm learns from its mistakes at each iteration. Gradient boosting tends to perform well with a large number of relatively shallow trees, so I will try small values for `max_depth`.

In [None]:
# Set up parameter grid and declare a new estimator
params_gb = {'n_estimators': [100, 200],
            'max_depth': [7, 11, 15],
            'learning_rate': [0.01, 0.1, 1]}

gb_for_search = GradientBoostingClassifier(random_state=42)

In [None]:
# Execute the grid search
search_results_gb = GridSearchCV(estimator=gb_for_search, param_grid=params_gb, cv=3, 
                                 scoring=('accuracy', 'f1'), refit='f1', 
                                 n_jobs=1, verbose=5)
search_results_gb.fit(X, y)

In [None]:
## Print out metrics for the top 5 models
scores_gb = pd.DataFrame(search_results_gb.cv_results_)[['param_learning_rate', 
                                                         'param_max_depth', 
                                                         'param_n_estimators', 
                                                         'mean_test_accuracy', 
                                                         'mean_test_f1', 
                                                         'mean_fit_time']]
scores_gb.sort_values(by='mean_test_f1', ascending=False).head(5)

Here I noticed something interesting. The second-best model performs almost as well as the best model--both accuracy and precision agree to the second decimal place--but requires only about half the training time. Since gradient boosting models generally need much more training time than random forest, this is an important consideration. Using a model that is easier to train means I could more easily improve the model by training on more recent data if it became available. So, I will select the *second* model in the table above as the optimized gradient boosting classifier.

I now train a model with the chosen hyperparameters from the search, and evaluate its performance on the test set. 

In [None]:
## Choose the estimator with the best f1 score
best_gb = GradientBoostingClassifier(n_estimators=100, 
                                     learning_rate=0.1, max_depth=11, random_state=42)

In [None]:
%%time
## Fit the optimized gradient boosting model to all of the training data
best_gb.fit(X, y)

In [None]:
%%time
## Now, use the optimized gradient boosting model to make predictions for the test set
y_val_pred_gb = best_gb.predict(X_val)

In [None]:
## Print evaluation metrics for this model.
print("Accuracy: {:.2f}".format(accuracy_score(y_val_pred_gb, y_val)))
print("Precision: {:.2f}".format(precision_score(y_val_pred_gb, y_val)))
print("Recall: {:.2f}".format(recall_score(y_val_pred_gb, y_val)))
print("f1 score: {:.2f}".format(f1_score(y_val_pred_gb, y_val)))

Finally, I plot feature importances for the gradient boosting model.

In [None]:
import matplotlib.pyplot as plt

plt.subplots(figsize=(10, 5))
importances = best_gb.feature_importances_
labeled_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)[:10]
labeled_importances.plot(kind='bar')
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Best gradient boosting model feature importances');

#### Conclusion: gradient boosting

The gradient boosting model performed similarly to random forest without hyperparameter turning. After tuning hyperparameters, I found that gradient boosting was significantly more powerful. I was able to acheive an f1 score of 0.85, a full 6 percentage-points higher than my best random forest model. The main downside was a noticeably longer training time.

### Choosing the best model

The best model is gradient boosting, with the `max_depth` parameter set to 11. (All other parameter values are defaults). This model outperformed my best random forest model on four metrics: accuracy, precision, recall, and f1-score. The most notable difference was in precision, which was 0.83 for gradient boosting verus 0.73 for random forest. Hence the gradient boosting model had fewer false positives; it is much less likely to falsely predict that a loan would go into default. 

The major disadvantage of gradient boosting is a longer fit time. My best gradient boosting model had a mean fit time of 750 seconds when doing three-fold cross-validation on the training data. For random forest, the mean fit time was about 200 seconds. While this is a noticable difference, the training time for gradient boosting is not prohibitive. In addition, the gradient boosting model had much faster prediction times (approximately 2 seconds to predict the test set, compared to 22 seconds for random forest). With its faster, more precise predictions, gradient boosting is worth the extra training time.

As a final remark, it is reassuring to note that random forest and gradient boosting had the same top-five features, although gradient boosting put even more emphasis on the `term` feature than random forest. The fact that the models agreed on which features are import suggests that both are finding genuine relationships in the data.