## Random Forest

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Null Model](#Null-Model)
- [Decision Tree](#Decision-Tree)
- [Bagging](#Bagging)
- [Random Forest](#Random-Forest)
- [Comparison](#Comparison)

First, let's create some functions to help us in the future.

```
def acc(yhat, y):
    import numpy as np
    acc = np.mean(yhat == y)
    return acc
    
def rmse(yhat, y):
    import numpy as np
    RMSE = np.sqrt(np.mean(  (yhat - y)**2  ))
    return RMSE
```

In [None]:
%run metrics.py

In [None]:
%whos

***********
# Preliminaries
[TOP](#Random-Forest)

We will be comparing three models predicting the label `urate_bin`

1. decision tree
2. bagged decision trees
3. random forest

Loading the packages and prepping the data.

In [None]:
# utilities
import pandas as pd

# processing
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split

# algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_pickle('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.pkl')
# A note about why we are not converting year

In [None]:
y = df['urate_bin'].astype('category')
x = df.drop(columns = 'urate_bin')

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                   train_size = 2/3,
                                                   random_state = 490)

****
# Null Model
[TOP](#Random-Forest)

In [None]:
yhat_null = y_train.value_counts().index[0]
acc_null = acc(yhat_null, y_test)
acc_null

****************
# Decision Tree
[TOP](#Random-Forest)

To compare the tree-based models, we are going to start with a single decision tree classifier.

In [None]:
param_grid = {
    'max_leaf_nodes': range(1,40)
}

dtc_cv = DecisionTreeClassifier(random_state = 490)

grid_search = GridSearchCV(dtc_cv, param_grid,
                          cv = 5, 
                          scoring = 'accuracy',
                          n_jobs = 10, 
                          verbose = 2)
grid_search.fit(x_train, y_train)
best_dtc = grid_search.best_params_
best_dtc

In [None]:
fit_dtc = DecisionTreeClassifier(random_state = 490,
                                 max_leaf_nodes = best_dtc['max_leaf_nodes'])
fit_dtc.fit(x_train, y_train)
acc_dtc = fit_dtc.score(x_test, y_test)
acc_dtc

**********
# Bagging
[TOP](#Random-Forest)

Remember that bagged trees consider ALL features.

In [None]:
fit_bc = BaggingClassifier(n_estimators = 500,
                          random_state = 490,
                          oob_score = True,
                          n_jobs = 10,
                          verbose = 1)
fit_bc.fit(x_train, y_train)

In [None]:
fit_bc.oob_score_

In [None]:
fit_bc.score(x_test, y_test)

Alternatively...

In [None]:
fit_bag_tree = RandomForestClassifier(n_estimators = 500, 
                                  max_features = None,
                                  oob_score = True,
                                  n_jobs = 10,
                                  random_state = 490,
                                  verbose = 1)
fit_bag_tree.fit(x_train, y_train)

In [None]:
fit_bag_tree.oob_score_

In [None]:
acc_bt = fit_bag_tree.score(x_test, y_test)
acc_bt

*********
# Random Forest
[TOP](#Random-Forest)


Let's see if we can beat the bagged model!

In [None]:
fit_rf = RandomForestClassifier(n_estimators = 500, 
                                  max_features = 'sqrt',
                                  oob_score = True,
                                  n_jobs = 10,
                                  random_state = 490,
                                  verbose = 1)
fit_rf.fit(x_train, y_train)

In [None]:
fit_rf.oob_score_

In [None]:
acc_fit_rf = fit_rf.score(x_test, y_test)
acc_fit_rf

In [None]:
df_plot = pd.DataFrame(fit_rf.feature_importances_,
            index = x_train.columns,
            columns = ['Feature Importance']).sort_values(by = 'Feature Importance',
                                                         ascending = False)

In [None]:
sns.barplot(data = df_plot,
           x = 'Feature Importance',
           y = df_plot.index,
           color = 'darkorange')
plt.show()

************
# Comparison
[TOP](#Random-Forest)

In [None]:
sk_fig = plot_confusion_matrix(fit_dtc, x_test, y_test)
plt.title('Decision Tree')

plt.show()

In [None]:
sk_fig = plot_confusion_matrix(fit_bag_tree, x_test, y_test)
plt.title('Bagged Tree')

plt.show() 
# Better at predicting similar when similar
# Worse at predicting similar when lower or higher

In [None]:
(418+401)/(418+401+1008)

In [None]:
sk_fig = plot_confusion_matrix(fit_rf, x_test, y_test)
plt.title('Random Forest')

plt.show()
# Better at predicting similar when similar
# Worse at predicting similar when lower or higher

In [None]:
(293+292)/(293+292+838)

************
# `sklearn` is pretty cool 
So, check this out

In [None]:
classifiers = [DecisionTreeClassifier(max_leaf_nodes = best_dtc['max_leaf_nodes']),
              BaggingClassifier(n_estimators = 500,
                          random_state = 490,
                          n_jobs = 10),
              RandomForestClassifier(n_estimators = 500, 
                                  max_features = 'sqrt',
                                  n_jobs = 10,
                                  random_state = 490)]

In [None]:
%%time
for clf in classifiers:
    clf.fit(x_train, y_train)

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 3)
plt.close() # don't show

In [None]:
type(ax)
type(ax).__name__
ax.shape
ax[0]
ax.flatten() # not necessary in this case
ax == ax.flatten()

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 3,
                       figsize = (16, 4))

for clf, axis in zip(classifiers, ax.flatten()):
    plot_confusion_matrix(clf,
                         x_test,
                         y_test,
                         ax = axis)
    axis.title.set_text(type(clf).__name__)

plt.tight_layout()