## Chapter 11: Principles of Feature Learning

# 11.9 Bagging Cross-Validated Models

You can toggle the code on and off in this presentation via the button below.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

- The random nature of splitting data into training and validation poses a potential flaw to our cross-validation process: <b><u>bad training-validation splits</u></b>.


- Such bad splits are not desirable representatives of the underlying phenomenon that generated them, which can result in poorly representative cross-validated models.

- A practical solution to this fundamental problem is to simply perform several different training-validation splits, determine an appropriate cross-validated model on each split, and then *average* the resulting cross-validated models. This is called <b><u>bagging</u></b>.


- By averaging a set of cross-validated models we can *very often* both 'average out' the potentially undesirable characteristics of each model while synergizing their positive attributes. 

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'

# plotting tools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## Bagging regression models

Generally the best way to bag (or average) cross-validated regression models is by taking their <b><u>median</u></b> (as opposed to their mean).

#### <span style="color:#a50e3e;">Example.</span>   Bagging cross-validated regression models

In the set of small panels in the left side of [Figure 11.47](#figure-11-47) we show $10$ different training-validation splits of a prototypical nonlinear regression dataset, where
$\frac{4}{5}$ of the data in each instance has been used for training (colored light blue) and $\frac{1}{5}$ is used for validation (colored yellow).  Plotted with each split of the original data is the corresponding cross-validated spanning set model found by naively cross-validating (see [Section 11.4](https://jermwatt.github.io/machine_learning_refined/notes/11_Feature_learning/11_4_Cross_validation.html)) the range of complete polynomial models of degree $1$ to $20$.  As we can see, while *many* of these cross-validated models perform quite well, several of them (due to the particular training-validation split on which they are based) severely *underfit* or *overfit* the original dataset.  In each instance the poor performance is completely due to the particular underlying (random) training-validation split, which leads cross-validation to a validation error minimizing tuned model that still does not represent the true underlying phenomenon very well.  By taking an *average* (here the *median*) of the $10$ cross-validated models shown in these small panels we can average-out the poor performance of this handful of bad models, leading to a final bagged model that fits the data quite well - as shown in the large right panel of [Figure 11.47](#figure-11-47).  

---

<a id='figure-11-47'></a>
<figure>
<p>
  <img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_47.png' width="70%"  alt=""/>
</p>
<figcaption> <em> (small panels) Ten different random training-validation splits of a nonlinear regression dataset (blue: training, yellow: validation), with the best cross-validated model drawn in each panel.<br><br> (large panel) The bagged (median) model of the $10$ models whose fits are shown on the left. 
</em>
</figcaption>
</figure>

More generally, if we created $E$ cross-validated regression models $\left\{\text{model}_{e}\left(\mathbf{x},\Theta_e^{\star}\right)\right\}_{e=1}^E$, each trained on a different training-validation split of the data, then our median model $\text{model}\left(\mathbf{x},\Theta\right)$ as 

\begin{equation}
    \text{model}\left(\mathbf{x},\Theta^{\star} \right) = \text{median}\left\{ \text{model}_{e}\left(\mathbf{x},\Theta_e^{\star}\right)\right\}_{e=1}^E.
\end{equation}

Note here the parameter set $\Theta^{\star}$ of the median model contains all of tuned parameters $\left\{\Theta_e^{\star}\right\}_{e=1}^E$ from the models in cross-validated set.

---

### Why the median, not the mean?


Because generally speaking, the mean is more sensitive to *outliers* the median.
<br><br><br>
<figure>
    
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_48_1.png' width="80%"  alt=""/>
</p>
</figure>

### Bagging polynomials, neural networks, and trees together  


With bagging we can also effectively combine cross-validated models built from different universal approximators.

<br>
<figure>
    
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_48_2.png' width="80%"  alt=""/>
</p>
</figure>

<a id='figure-11-48'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_48.png' width="90%"  alt=""/>
</p>
<figcaption> <strong>Figure: 11.48 </strong> <em> 
(top row) The $10$ individual cross-validated models first shown in the left column of Figure 11.47 shown together.  The median and mean of these models are shown in the middle and right panel, respectively.  With regression, bagging via the median tends to produce more trustworthy results as it is less sensitive to outliers. (bottom row) Cross-validated fixed-shape polynomial (left panel), neural network (second panel from the left), and tree-based (second panel from the right) models.  The median of these three models is shown in the right panel.  See text for further details.
</em>
</figcaption>
</figure>

---

When we bag we are simply averaging various cross-validated models with the desire to both avoid bad aspects of poorly performing models, and jointly leverage strong elements of the well performing ones.  Nothing in
this notion prevents us from bagging together cross-validated models built using different universal approximators, and indeed this is the most organized way of combining different types of universal approximators in practice.

In the bottom row of [Figure 11.48](#figure-11-48) we show the result of a cross-validated polynomial model (left panel) built by naively cross-validating full polynomials of degree $1$ through $10$, a naively cross-validated (see [Section 11.4](https://jermwatt.github.io/machine_learning_refined/notes/11_Feature_learning/11_4_Cross_validation.html)) neural network model (in the second to the left panel) built by comparing models consisting of $1$ through $10$ units, and a cross-validated stump model (second to the right panel) built via boosting (see [Section 11.5](https://jermwatt.github.io/machine_learning_refined/notes/11_Feature_learning/11_5_Boosting.html)).  Each cross-validated model uses a different training-validation split of the original dataset, and the bagged (median) of these models is shown in the right panel.

---

## Bagging classification models

Because the predicted output of a classification model is a *discrete* label, the average used to bag such cross-validated models is the <b><u>mode</u></b> (i.e., the
most popularly predicted label).

#### <span style="color:#a50e3e;">Example.</span>   Bagging cross-validated two-class classification models

In the set of small panels in the left column of [Figure 11.49](#figure-11-49) we show $5$ different training-validation splits of the prototypical two-class classification dataset, where $\frac{2}{3}$ of the data in each instance is used for training and $\frac{1}{3}$ is used for validation (the edges of these points are colored yellow).  Plotted with each split of the original data is the nonlinear decision boundary corresponding to each  cross-validated model found by naively cross-validating the range of complete polynomial models of degree $1$ to $8$.  *Many* of these cross-validated models perform quite well, but several of them (due to the particular training-validation split on which they are based) severely *overfit* the original dataset.  By bagging these models using the most popular prediction to assign labels (i.e., the *mode* of these cross-validated model predictions) we produce an appropriate decision boundary for the data shown in the right panel of the Figure.

---

<a id='figure-11-49'></a>
<figure>
<p>
  <img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_49.png' width="50%"  alt=""/>
</p>
<figcaption> <em> 
(small panels) Five models cross-validated on random training-validation splits of the data, with the validation data in each instance highlighted with a yellow outline.<br><br> (large panel) The bagged (modal) model of the $5$ models shown on the left.   
</em>
</figcaption>
</figure>

---

---

In the top-middle panel of [Figure 11.50](#figure-11-50)  we illustrate the decision boundary from each of $5$ naively cross-validated models each built using $B = 20$ single layer $\text{tanh}$ units trained on different training / validation splits
of the dataset shown in the top-left panel of the Figure.  In each instance $\frac{1}{3}$ of the dataset is randomly chosen as validation (and is highlighted in yellow), and the appropriate tuning of each model's parameters is achieved via $\ell_2$ regularization based cross-validation (see [Section 11.6](https://jermwatt.github.io/machine_learning_refined/notes/11_Feature_learning/11_6_Regularization.html)) using a dense range of values for $\lambda \in [0,0.1]$.  In the top-middle panel we plot the diverse set of decision boundaries associated to each cross-validated model on top of the original dataset, each colored differently so they can be distinguished visually. While some of these decision boundaries separate the two classes quite well, others do a poorer job. In the top-right panel we show the decision boundary of the bag, created by taking the mode of the predictions from these cross-validated models, which performs quite well.

---

<a id='figure-11-50'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_50.png' width="90%"  alt=""/>
</p>
<figcaption> <strong>Figure: 11.50 </strong> <em> 
(top row) (middle panel) The decision boundaries, each shown in a different color, resulting from $5$ models cross-validated on different training-validation splits of the dataset shown in the left panel.  (right panel) The decision boundary resulting from the mode, the 'modal model,' of the $5$ cross-validated models whose decision boundaries are shown in the middle panel.  (bottom row) The decision boundaries provided by a cross-validated fixed-shape model (left panel), neural network model (second from the left panel), and tree-based model (second panel from the right).  In each instance the validation portion of the data is highlighted in yellow.  (right panel)  The decision boundary provided by the mode of these three models.  See text for further details.
</em>
</figcaption>
</figure>

---

As with regression, with classification we can also combine cross-validated models built from different universal approximators.  We illustrate this in the bottom row of [Figure 11.50](#figure-11-50) using the same dataset.   In particular we show the result of a naively cross-validated polynomial model (left panel) built by comparing full polynomials of degree $1$ through $10$, a naively cross-validated neural network model (in the second to the left panel) built by comparing models consisting of $1$ through $10$ units, and a cross-validated stump model (second to the right panel) built via boosting over a range of $20$ units.  Each cross-validated model uses a different training-validation split of the original dataset (the validation data portion highlighted in yellow in each panel), and the bag (mode) of these models shown in the right panel performs quite well.

### Bagging different types of universal approximators


As with regression, with classification we can also combine cross-validated models built from different universal approximators.

<br>
<figure>
    
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_50_2.png' width="80%"  alt=""/>
</p>
</figure>

#### <span style="color:#a50e3e;">Example.</span>   Bagging multi-class models

In this example we illustrate the bagging of various cross-validated multi-class models on two different datasets, shown in the left column of [Figure 11.51](#figure-11-51).  In each case we naively cross-validate  full polynomial models of degree $1$ through $5$, with $5$ cross-validated models learned in total.  In the middle column of the Figure we show the decision boundaries provided by each cross-validated model in distinct colors, while the decision boundary of the final modal model is shown in the right column for each dataset, which perform very well.

---

<a id='figure-11-51'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_51.png' width="70%"  alt=""/>
</p>
<figcaption> <em> 
</em>
</figcaption>
</figure>

---

## How many models should we bag in practice?

- There is general no magic number. 

- The smaller the dataset, the less we could trust in the faithfulness of a random validation portion of it to represent the underlying phenomenon that generated the data, and hence we might wish to ensemble more of them to help average our poorly performing models resulting from bad splits of the data.

- Usually practical considerations like computation power as well as dataset size determine if bagging is used and if so how many models are employed in the average.

## Bagging vs. Boosting

- Bagging and boosting are both "ensembling" methods, as they are used to ensemble or combine different models to improve efficacy. 

- With <u><b>boosting</b></u> we build up a <u><b>single</b></u> cross-validated model by gradually <u><b>adding</b></u> together simple models consisting of a single universal approximator unit. Each of these units are trained in a way that makes each individual model <u><b>dependent</b></u> on its predecessors (that are trained first).

- With <u><b>bagging</b></u> we <u><b>average</b></u> together <u><b>multiple</b></u> models that have been trained <u><b>independently</b></u> of each other. Indeed any one of those cross-validated models in a bagged ensemble can itself be a boosted model.