### <span style="color:black"><u>**Random Forest Regression**</u><a name="RandomForest"></a></span>

**Introduction**
* Because decision trees are typically high variance estimators, they are generally not the most effective model to choose 
* What if we could build multiple trees by selecting random observations (with replacement allowed), random features (at each step of building the tree) and take the average of all of the predictions from each tree & have that be our prediction instead. That is the essense of a random forest.
* [Random Forests](https://gdcoder.com/random-forest-regressor-explained-in-depth/) are a machine learning algorithm that can be used for both classification and regression tasks
* Random forest is an amalgamation (ensemble) of individual decision trees, that are each created using [bootstrapping](https://www.quantstart.com/articles/bootstrap-aggregation-random-forests-and-boosted-trees/) (sampling with replacement)
* Now because many trees are built when building a random forest regression models, it is far less likely that our model will suffer from overfitting compared to building only one decision tree, like I did before
* Other forms of ensemble learning include boosting (e.g. Gradient Boosting, xGBoost etc.) and stacking.

**The Algorithm**
* We randomly select some samples from the original dataset (replacement is allowed) and we create a bootstrapped dataset that is the same size as the original dataset. Since replacement is allowed, we might get the same observation 2+ times in the training set (which is fine)
* Using this bootstrapped dataset we create a regression tree (using the techniques previously used for decision trees). In doing so we also use a random subset of features <u>at each step</u> of building the tree
* We repeat this process, and build $H$ regression trees. This is a hyperparameter that can be tuned.
* Because not every tree sees all observations and all features, we ensure that the trees are 'de-correlated' and thus our model is less prone to overfitting
* Ultimately, the idea is that we create many regression trees, $\hat{f}^1 (x), \hat{f}^2 (x), \hat{f}^3 (x), ..., \hat{f}^H (x)$ using a total of $H$ bootstrapped datasets, performing node splits through considering only a subset of the features at a time.
* Eventually, once all $H$ trees have been built, we average out the predictions made from all of the models to output our final prediction in which 

$$\hat{f}_{\text{final prediction}} (x) = \frac{1}{H}\sum_{h = 1} ^H \hat{f}^{h} (x)$$

becomes our final prediction. This process of boostrapping and then aggregating is sometimes referred to as [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating), and the fact that we only consider a subset of the predictors at each potential node split means that we have a random forest model rather than just bagging on its own.
* It is often the case that we allow each tree $\hat{f}^{h} (x)$ to grow very deep to reduce the bias of the individual tree to capture the true relationship. While this often results in a high variance estimate, taking the average will reduce the variance since the trees are built independently
* It is not uncommon to use bagging for 100's if not 1000's of trees, though this can come at the expense of computation speed, and the fact that error might not neccessarily go down significantly after a few 100's of trees
* Once we have built our model, we can also ascertain which variables across all our trees played the largest role in reducing the residual sum of squares using 'feature importance' functionality from Scikit-Learn. In this sense, we can even use random forests to help with feature selection for other models


**Hyperparameters to Tune**
* The hyperparameters for random forest regression is the same as for decision tree regression, but now we have access to an `n_estimators` hyperparameter which is the number of trees, $H$, to build when testing out each hyperparameter combination 


**Potential Disadvantages**
* Because the random forest algorithm allows you to create $H$ bootastrapped datasets and average out all the predictions, when $H$ gets large, one can no longer visualise all the trees and the ability to interpret our model deteriorates. Though, while this is the case, they typically perform much better on out of sample data than just a single decision tree
* One other disadvantage, similar to decision trees, is that they can't extrapolate


In [1]:
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

# Personal display settings
#===========================

# Suppress scientific notation
np.set_printoptions(suppress=True)

# Get dataset values showing only 2dp
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_colwidth', None)

# For clear plots with a nice background
plt.style.use('seaborn-whitegrid') 
%matplotlib inline

%load_ext autoreload
%autoreload 2

# python files
import data_prep
import helper_funcs

In [2]:
train = pd.read_csv('../datasets/train_updated.csv')
test = pd.read_csv('../datasets/test_updated.csv')

# Split data
to_drop = ['Country', 'HDI', 'Life_exp']

X_train = train.drop(to_drop, axis='columns')
X_test = test.drop(to_drop, axis='columns')

y_train = train['Life_exp']
y_test = test['Life_exp']

In [3]:
pipe = data_prep.create_pipeline(RandomForestRegressor())
pipe

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('numeric',
                                                  Pipeline(steps=[('identity',
                                                                   FunctionTransformer())]),
                                                  ['GDP_cap']),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['Status'])])),
                ('imputation', KNNImputer()),
                ('model', RandomForestRegressor())])

In [4]:
param_grid = {
    'imputation__n_neighbors': np.arange(3, 21, 2), 
    'imputation__weights': ['uniform', 'distance'], 
    'model__n_estimators': np.arange(10, 200, 1),  # lots of trees
    'model__min_samples_split': np.arange(10, 40, 1),
    'model__min_samples_leaf': np.arange(3, 30, 1),
    'model__max_depth': np.arange(3, 12, 1),
    'model__criterion': ["mse", "absolute_error"] 
}

tuned_model = data_prep.randomised_search_wrapper(X_train,
                                                  y_train,
                                                  pipe, 
                                                  param_grid, 
                                                  cv=10,
                                                  n_iter=50)

Best Parameters were...
model__n_estimators had optimal value as: 187
model__min_samples_split had optimal value as: 15
model__min_samples_leaf had optimal value as: 7
model__max_depth had optimal value as: 9
model__criterion had optimal value as: absolute_error
imputation__weights had optimal value as: uniform
imputation__n_neighbors had optimal value as: 7

The fitted model just initialised now has all these parameters set up
