# 4. Ensembles

In [66]:
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, StackingRegressor
from california_data_pipeline import load_train_test
from bayesian_linear_regressor import BayesianLinearRegression

## Ensemble intro

### Wisdom of the crowd

If you had a large group of people all take an estimate on how many marbles were in a jar you would get a wide range of answers. An expert on marble guessing may be in the top 10% of closest guesses, but they may still be far off.

However if you had access to all these guesses and simply took the median of them all, you would likely be considered better than the experts.

This phenomenon is known as the wisdom of the crowd; where the average guess often better than average.

### Ensemble methods

Take the above example, but replace marble counters with machine learning models. The predictions of these models can be combined, each taking votes for classification or the results being averaged for regression. This ensemble of models will also outperform its individual components.

A weak learner is a classifier which just better than average. If you created an ensemble of classifiers each with 51% accuracy, given enough weak learners, the ensemble model can perform very well becoming a strong learner.

### Bagging and pasting

One way to create an ensemble is to use the same model on multiple subsets of the data. The individual results of each model are then combined together to get a results which is usually better than the best performing individual model.

When samples are taken out of the dataset and not replaced this is known as pasting, when they are replaced bagging is being performed. Pasting means each model can learn individual patterns as there is no shared data, but a lot of data is needed for this to work well as you are affectively dividing the size of the dataset each model will learn on by the number of models in the ensemble.

One benefit of bagging is that each of the classifiers may not see all of the data point in the training set even though data is being replaced. These unseen data points are known as out-of-bag samples and can be used to validate the models without the need for a seperate validation set which means more data can be used to train the model.

Another method of bagging involves selecting a random number of features to use for that round of training. The idea being that removing some of the features during training rounds reduces an over-reliability on a few features and makes the model learn a more generalised pattern from the data. This is very similar to how dropout can help reduce overfitting in neural networks.

## 4.1 Random Forest

### California housing data

We shall be using the California housing dataset we used with Bayesian regression.

As we have already performed our analysis on the data and selected and transformed our features we shall replicate the same data pipeline from the previous notebook.

To do this I have implemented a transformation pipeline module which gets the California housing data, transforms and creates features as described previously and then returns the train and test data and real values. The function has optional parameters for test size and transformation type, see module for info.

In [3]:
X_train, X_test, y_train, y_test = load_train_test()

In [4]:
X_train.head()

Unnamed: 0,MedInc,AveOccup,AveBedrmsPerRoom,AveAddRooms,EstHouses,DistToTown
15536,-1.114001,-0.813383,1.6164,-1.473897,0.796013,-0.59287
12143,-0.914646,1.260378,-0.905558,1.550903,-1.113609,0.49775
9669,-0.25858,2.651988,1.785041,-2.053961,-1.649685,-1.150917
3005,-0.534559,-0.116728,0.497786,-0.324815,0.074911,2.346136
17756,-0.165095,-0.325182,-0.864095,0.789426,0.039902,1.489587


In [5]:
y_train.head()

Unnamed: 0,MedHouseVal
15536,0.253606
12143,-1.014571
9669,-0.24576
3005,-1.446104
17756,-0.792733


### Decisions trees

A decision tree performs classification or regression based on a flow-chart like graph, where an input to the tree starts at the root and makes its way through the tree based on conditions for each feature of the data point at each node of the tree. At the bottom of the tree is the output prediction for the input features. This is best demontsrated with a diagram 

>>> INSERT DECISION TREE

Decision tree models are trained through supervised learning. The tree finds the optimal way to split the training data at each level, an example might be splitting the tree for houses that are less than 10km from LA and those which are not. The combination of these feature splits can result in very complex models being learned.

We shall train a simple decision tree regressor with the preprocessed California housing data we just loaded in.

In [58]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

DecisionTreeRegressor()

We can then make predictions of the test set values and get scores for R2, MSE, RMSE.

In [59]:
# output R2, MSE and RMSE regressor
def reg_metrics(reg, reg_name, X, y):
    reg_preds = reg.predict(X)
    reg_r2 = r2_score(y, reg_preds)
    reg_mse = mean_squared_error(y, reg_preds)
    reg_rmse = mean_squared_error(y, reg_preds, squared=False)
    print(f"{reg_name} regression R2:   {reg_r2:.4f}")
    print(f"{reg_name} regression MSE:  {reg_mse:.4f}")
    print(f"{reg_name} regression RMSE: {reg_rmse:.4f}")

In [60]:
reg_metrics(tree_reg, "decision tree", X_test, y_test)

decision tree regression R2:   0.4151
decision tree regression MSE:  0.5978
decision tree regression RMSE: 0.7732


As we have used the same data and the same preprocessing (albeit with different random test train split), we can make reasonably accurate comparisons between each of the models we have trained on the data.

Looking at the metrics above we can see our decision tree has done quite well, although not as good of a fit as our Bayesian linear regression model.

Lets see how we can apply ensemble methods to produce even better results.

### Random forest

A random forest is an ensemble of decision trees. Using bagging, a number of decision trees are trained together on sub-sets of the training data. The results from each tree are the agregated, producing a result far better than each of its parts.

Random forests also implement the random feature selection mentioned previously. Instead of splitting the tree with the best feature, it takes a subset of the features and finds the best split in that. This allows for more features to be utilised by the model and deeper patterns to be learned over non-randomising the feature sets.

As above we shall train on the test data and analyse the results, but this time with an ensemble of 10 decision trees.

In [61]:
forest_reg = RandomForestRegressor(n_estimators=10)
forest_reg.fit(X_train, np.ravel(y_train))

RandomForestRegressor(n_estimators=10)

In [62]:
reg_metrics(forest_reg, "random forest", X_test, y_test)

random forest regression R2:   0.6714
random forest regression MSE:  0.3358
random forest regression RMSE: 0.5795


With just 10 trees, we have managed to increase R2 and decrease both MSE and RMSE significantly.

Lets try increasing the number of estimators to further benefit from the windom of the crowds.

In [63]:
forest_reg = RandomForestRegressor(n_estimators=100)
forest_reg.fit(X_train, np.ravel(y_train))
reg_metrics(forest_reg, "random forest", X_test, y_test)

random forest regression R2:   0.6994
random forest regression MSE:  0.3073
random forest regression RMSE: 0.5543


Lets check how well it fit to the data it was tested on and see if we are overfitting.

In [None]:
# get score for training set

It seems like the model has overfit and not generalised to the test set as well as it could have. Increasing the number of estimators and constraining the tree depth should force the ensemble to generalise more as it learns more important patterns.

In [64]:
forest_reg = RandomForestRegressor(n_estimators=1000, max_depth=10, n_jobs=-1)
forest_reg.fit(X_train, np.ravel(y_train))
reg_metrics(forest_reg, "random forest", X_test, y_test)

random forest regression R2:   0.7000
random forest regression MSE:  0.3066
random forest regression RMSE: 0.5538


As expected constraining each model and forcing each to learn stronger patterns helps the ensemble generalise to the 

We shall run a random hyper parameter search on some different parameters to get a better idea of optimal tree constraints.

Random forest default parameters:
- n_estimators=100,
- max_depth=Non
- min_samples_split=2
- min_samples_leaf=1
- min_weight_fraction_leaf=0.0
- max_features='auto'
- max_leaf_nodes=None
- min_impurity_decrease=0.0
- min_impurity_split=None
- bootstrap=True
- ccp_alpha=0.0
- max_samples=None

In [81]:
n_estimators = np.arange(100, 5000)
max_features = ['auto', 'sqrt', 'log2']
max_depth = np.arange(2, 100)
min_samples_split = np.arange(1, 100)
min_samples_leaf = np.arange(1, 100)
bootstrap = [True, False]

forest_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

forest_reg = RandomForestRegressor()

forest_randomcv = RandomizedSearchCV(forest_reg, forest_grid, n_iter=100, cv=3)

In [None]:
forest_randomcv.fit(X_train, np.ravel(y_train))

In [None]:
from sklearn.externals import joblib
joblib.dump(forest_randomcv, 'forest_randomcv.pkl')

- discuss results from search
- decide on final parameters

### Limiting branches and features

As discussed above limiting the 

### Trees in a forest

- Plot the trade off between training time and performance
- What is the optimal (wrt. training time and accuracy) number of trees

In [None]:
# loop training optimal model with varying number of trees
# record trianing time for each model
# plot training time (number of trees) vs oob error

### Extremely randomised trees

Also called extra trees, extremely randomised trees are an additional extension to the random forest. Where random forests select the best split of a random subset of features, extra trees further increase randomness by selecting a random threshold for each feature.

This can produce models which learn more generalised patterns, but their main benifit is they have a much lower computational complexity than random forests.

Lets create an extra tree regressor and see how it performs on the training data.

- Additional check of extra trees vs random forest
    - use opt forest params for both
    - or quick grid search for extra trees

In [None]:
extra_reg = ExtraTreesRegressor()
extra_reg.fit(X_train, y_train)

Discuss results compared to random forest.

### Interpreting a forest

- Easier to look at a decision tree but still too large to comprehend
    - nothing to learn from looking at the tree that the model hasnt learnt already
- Decision trees are weak learners
    - small change in input can lead to different trees
    - look at .feature_importance_ for a few subsets
- Random forests are much stronger learners
    - look at .feature_importance_ for same subsets
    - shouldnt change as much as decision tree

## 4.2 Stacking

### Stacking

Stacking takes the idea of ensemble learning further by replacing the simple model agrigation methods such as voting and averaging with another model which is trained by each model's votes. Each model in the prediction ensemble gives its results to a blending model which outputs a final result.

>>> Stacking diagram

### Stacking random forests

- Stack random forests together
- Compare to optimal random forest
- Create ensemble of random forests and take mean of results
- Compare to stacked and normal random forest

### Multi-layer stacking

The idea of stacking can be extended further, allowing multiple layers of stacked models on-top of eachother. The layers of models can also be different types of models themselves. 

I shall use this method to stack together a random forest regressor and a bayesian linear regressor. The idea being that the two varied approaches to regression come together for a stronger solution.

### Bayesian linear regression

In the previous task I created a Scikit-learn wrapper for the PyMC3 baysian linear regression model I created for the California housing data. I shall use this in my stacking ensemble.

In [3]:
from bayesian_linear_regressor import BayesianLinearRegression

bayesian_reg = BayesianLinearRegression()

Note: as the bayesian linear regression model produces a probability distribution rather than a single estimate, the predict method of the class predicts using the mean of the samples from the distribution. This means I can return a R2 score allowing iheritance of Scikit-learn `base.RegressorMixin` and stacking with `StackingRegressor`.

### Stacking tasks

Bayesian linear regression and decision trees are two very different approaches to regression.  Ensemble methods can exploit such diversity between different methods to improve performance.  So now you will try combining the random forest  and  Bayesian  linear  regression  using stacking.

Scikit-learn  includes the  StackingRegressor  class  to  help  you  with  this.   In  the  report,  explain the stacking approach and describe your results,  making sure to cover the following points:

1. When does stacking improve performance over the individual models (e.g. try stacking with a random forest with 'maxdepth=10' and 'nestimators=10')?
2. What happens if we just take the mean prediction from our base models instead?
3. Use a DecisionTreeRegressor as the final estimator and visualise the tree to understand what stacking is doing.

## Ensemble Conclusion

### Comparison of random forest and stacking

- Comparison of random forest and stacking methods
  - Accuracy
  - Computational complexity
  - Data needed
  - Generalisation / ease of use