# Random Forest & Bagged Decision Tree Optimization Comparison

Notebook performing hyperparameter optimization on Random Forest and Bagged Decision Tree models to improve performance and see if they can be our main model.

Magnus Bigelow

## Contents

- [Imports](#Imports)
- [Useful Functions](#Useful-Functions)
- [Bagged Decision Tree](#Bagged-Decision-Tree)
- [Random Forest](#Random-Forest)
- [AdaBoost](#AdaBoost)
- [Voting Classifier](#Voting-Classifier)
- [Voting Classifier w/KNN](#Voting-Classifier-w/KNN)

## Imports

In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# General Modeling Imports 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

# Classification models
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier,  AdaBoostClassifier

In [3]:
train = pd.read_csv('./data/clean_train.csv')

In [4]:
train.head()

Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,wage,marital_status_num,occupation_com_House_Services,occupation_com_Other,occupation_com_Professional,occupation_com_Specialty,occupation_com_Tech/sales,workclass_com_ Government,workclass_com_ Other,workclass_com_ Private,workclass_com_ Self-employed,cap_gain_binary,cap_loss_binary,gdp_pc
0,39,13,1,2174,0,40,0,0,0,1,0,0,0,1,0,0,0,1,0,41524.09
1,50,13,1,0,0,13,0,1,0,0,1,0,0,0,0,0,1,0,0,41524.09
2,38,9,1,0,0,40,0,0,1,0,0,0,0,0,0,1,0,0,0,41524.09
3,53,7,1,0,0,40,0,1,1,0,0,0,0,0,0,1,0,0,0,41524.09
4,28,13,1,0,0,40,0,1,0,0,1,0,0,0,0,1,0,0,0,12492.097


### Useful Functions

In [5]:
# Function to print train and test F1 score
def f1(model, X_train, y_train, X_test, y_test):
    y_train_p = model.predict(X_train)
    y_test_p = model.predict(X_test)
    f_train = f1_score(y_train,y_train_p)
    f_test = f1_score(y_test,y_test_p)
    print(f'Train F1: {round(f_train,3)}')
    print(f'Test F1: {round(f_test,3)}')

In [23]:
# Function to calculate and display classification metrics, works for bernoulli y
def class_metrics(model, X, y):
    # Generate predictions
    preds = model.predict(X)
    # Get confusion matrix and unravel
    tn, fp, fn, tp = confusion_matrix(y,preds).ravel()
    # Accuracy
    print(f'Accuracy: {round((tp+tn)/len(y),3)}')
    # Sensitivity
    print(f'Sensitivity: {round(tp/(tp+fn),3)}')
    # Specificity
    print(f'Specificity: {round(tn/(tn+fp),3)}')
    # Precision
    print(f'Precision: {round(tp/(tp+fp),3)}')

### Train-Test Split

In [8]:
# Set up X and Y
X = train[['age', 'education-num', 'sex', 
       'hours-per-week', 'marital_status_num',
       'occupation_com_House_Services', 'occupation_com_Professional',
       'occupation_com_Specialty','occupation_com_Tech/sales', 
       'workclass_com_ Government','workclass_com_ Private',
       'workclass_com_ Self-employed', 'cap_gain_binary', 
       'cap_loss_binary','gdp_pc']]
y = train['wage']

# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=33)

### Bagged Decision Tree

For the bagged decision tree most of the hyperparameters that we can tune will make it more like a random forest, as such we will try different values for the n_estimators but otherwise leave the model as is.

In [29]:
# Adapted from GA DSI Lesson 6.03
bt = BaggingClassifier(random_state=33)
bt_params = {
    'n_estimators': [50, 100, 250],
}

bt_gs = GridSearchCV(bt, 
                  param_grid=bt_params,
                  cv=5,
                  n_jobs=2)

bt_gs.fit(X_train,y_train)

# Show metrics and best parameters
print(f'Best hyperparameter: {bt_gs.best_params_}\n')
f1(bt_gs,X_train,y_train,X_test,y_test)

Best hyperparameter: {'n_estimators': 250}

Train F1: 0.91
Test F1: 0.6


In [33]:
class_metrics(bt_gs,X_test,y_test)

Accuracy: 0.816
Sensitivity: 0.574
Specificity: 0.892
Precision: 0.629


Our bagged decision tree model improved it's performance from the baseline model we ran:
- Train F1: 0.888 -> 0.91
- Test F1: 0.581 -> 0.6

We can see from the test metrics that it is doing a great job of classifying negative cases but not a good job of classifying positive cases, which is expected given our imbalanced classes. 

This increase in performance comes at the cost of running 250 tree models instead of the baseline 10 models and means that even if we ran infinite models it is unlikely that our performance would improve much. Let's look at the Random Forest Model.

### Random Forest

For the random forest we will attempt a higher number of estimators, play with the max_depth (i.e. how many nodes down / how many features each model uses), warm_start (let's the model learn from the previous model to continue improving) and min_samples_leaf which is the number of samples that need to be at the end of each leaf node.

In [15]:
# Adapted from GA DSI Lesson 6.03
rf = RandomForestClassifier(random_state=33)
rf_params = {
    'n_estimators': [100, 125],
    'max_depth': [None, 3, 7, 11],
    'warm_start': [True,False],
    'min_samples_leaf': [1, 5]
    
}

rf_gs = GridSearchCV(rf, 
                  param_grid=rf_params,
                  cv=5,
                  n_jobs=2)

rf_gs.fit(X_train,y_train)

# Show metrics and best parameters
print(f'Best hyperparameter: {rf_gs.best_params_}\n')
f1(rf_gs,X_train,y_train,X_test,y_test)

Best hyperparameter: {'max_depth': 11, 'min_samples_leaf': 5, 'n_estimators': 100, 'warm_start': True}

Train F1: 0.657
Test F1: 0.635


In [25]:
class_metrics(rf_gs,X_test,y_test)

Accuracy: 0.847
Sensitivity: 0.552
Specificity: 0.941
Precision: 0.748


The best random forrest model used a max_depth of 11, min_samples_leaf of 5, standard number of estimators and a warm_start.

We got a test accuracy of 0.847 and a test F1 of 0.635 which is an improvement on our completely standard F1 score and on the bagged decision tree.

While the train F1 score went down this is good as our model is no longer severely overfit and we can trust it's performance on new data more.

### AdaBoost

For the AdaBoost model we will attempt with the default 50 estimators and also with 100 estimators.  Additionally we will play around with the learning rate which weights subsequent trees differently when making predictions to try to improve accuracy.

In [18]:
# Adapted from GA DSI Lesson 6.03
ab = AdaBoostClassifier(random_state=33)
ab_params = {
    'n_estimators': [50, 100],
    'learning_rate': [1, .5, 1.5]
}

ab_gs = GridSearchCV(ab, 
                  param_grid=ab_params,
                  cv=5,
                  n_jobs=2)

ab_gs.fit(X_train,y_train)

# Show metrics and best parameters
print(f'Best hyperparameter: {ab_gs.best_params_}\n')
f1(ab_gs,X_train,y_train,X_test,y_test)

Best hyperparameter: {'learning_rate': 1, 'n_estimators': 50}

Train F1: 0.641
Test F1: 0.646


In [26]:
class_metrics(ab_gs,X_test,y_test)

Accuracy: 0.847
Sensitivity: 0.581
Specificity: 0.931
Precision: 0.727


Our best AdaBoost model ended up being the default model and it is our best one so far. The train Accuracy was 0.847 and the test F1 score is 0.646 which is our highest yet. 

The F1 is still low because we have unbalanced classes and are poor at predicting the positive case. 

Let's try a voting classifier using the 3 models we've fit above and the best hyperparameters.

### Voting Classifier

In [19]:
from sklearn.ensemble import VotingClassifier

In [31]:
vote = VotingClassifier([
            ('bt',BaggingClassifier(random_state=33)),
            ('rf',RandomForestClassifier(random_state=33, max_depth=11, min_samples_leaf=5)),
            ('ab',AdaBoostClassifier(random_state=33, n_estimators=50)) 
])

# Fit 
vote.fit(X_train,y_train)

# Show metrics 
f1(vote,X_train,y_train,X_test,y_test)

Train F1: 0.682
Test F1: 0.639


In [32]:
class_metrics(vote,X_test,y_test)

Accuracy: 0.847
Sensitivity: 0.562
Specificity: 0.938
Precision: 0.741


Our voting classifier performs slightly better than our AdaBoost model on the Training set, however, it performs more poorly on the testing set. This may be a result of noise in the training and testing set but for now the AdaBoost is still our best model. Perhaps adding a logistic regression and knn to the voting classifier would result in a superior score.

### Voting Classifier w/KNN

During my partners analysis he found a knn model with n_neighbors = 25 which had a higher sensitivity than any of the models in this notebook and had similar F1 and accuracy scores to our best models. As a result we will also try replacing the Bagged Trees model in our voting classifier (as it was the worst performing model) with a KNN classifier with n_neighbors = 25 and see if the results improve.

In [34]:
from sklearn.neighbors import KNeighborsClassifier

In [35]:
vote = VotingClassifier([
            ('knn',KNeighborsClassifier(n_neighbors=25)),
            ('rf',RandomForestClassifier(random_state=33, max_depth=11, min_samples_leaf=5)),
            ('ab',AdaBoostClassifier(random_state=33, n_estimators=50)) 
])

# Fit 
vote.fit(X_train,y_train)

# Show metrics 
f1(vote,X_train,y_train,X_test,y_test)

Train F1: 0.644
Test F1: 0.635


In [36]:
class_metrics(vote,X_test,y_test)

Accuracy: 0.847
Sensitivity: 0.551
Specificity: 0.941
Precision: 0.749


The voting classifier with the KNN model is actually slightly worse than the previous voting classifier and as a result we will go forward with the KNN model as our primary model, not using any of the models on this notebook. While the KNN model isn't clearly superior to the other models the higher sensitivity is desireable when we have unbalanced classes like we do.