## Chapter 11: Principles of Feature Learning

# 11.5 Efficient Cross-Validation via Boosting

You can toggle the code on and off in this presentation via the button below.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

With *boosting based cross validation* we perform our model search by taking a *single* high capacity model and optimize it *one unit at-a-time*, resulting in a much more efficient high resolution cross-validation procedure than the naive form of cross-validation.   

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'

# plotting tools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## The big picture

The basic principle behind boosting based cross-validation is to progressively build a high capacity model <b><u>one unit at-a-time</u></b>, using units from a single type of universal approximator.

as

\begin{equation}
\text{model}\left(\mathbf{x},\Theta\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} +  f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_M\left(\mathbf{x}\right)w_M.
\label{equation:boosting-original-construct}
\end{equation}

We do this sequentially in $M$ 'rounds' of boosting where at each round we add one unit to the model, completely optimizing this unit's parameters alone along with its corresponding linear combination weight (we keep these parameters fixed at these optimally tuned values forever more).

Alternatively we can think of this procedure as beginning with a high capacity model of the form above and - in $M$ rounds - optimizing the parameters of each unit one unit at-a-time (a form of coordinate-wise optimization). 

In either case, performing boosting in this way produces a sequence of $M$ tuned models that generally increase in complexity with respect to the training dataset, which we denote compactly as $\left\{\text{model}_m\right\}_{m=1}^M$, where the $m^{th}$ model consists of $m$ tuned units (each having been tuned one at-a-time in the preceding rounds).

Since just one unit is optimized at-a-time, boosting tends to provide a computationally efficient high resolution form of model search.

The general boosting procedure tends to produce training/validation error curves that generally look like those shown in the top panel of [Figure 11.32](#figure-11-32).  As with the naive approach detailed in the previous Section, here too we tend to see training error decrease as $m$ grows larger while validation error tends to start high where underfitting occurs, dip down to a minimum value (perhaps oscillating more than the one time illustrated here), and rise back up when overfitting begins.

---

<img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_32.png' width="68%"  alt=""/>
<b>(top panel)</b> Prototypical training and validation error curves associated with a completed run of boosting.<br><br> <b>(bottom panels)</b> We fix the capacity dial all the way to the right and the optimization dial all the way to the left, slowly turning it from left to right, with each notch denoting the complete optimization of one additional unit of the model.

---

Using the capacity/optimization dial conceptualization first introduced in Section 11.2.4, we can think about boosting as starting with our *capacity dial* set all the way to the *right* at some high value (e.g., some large value of $M$), and fidgeting with the *optimization dial* by turning it very slowly from left to right starting at the far left (as depicted in the bottom row of Figure 11.15).  As discussed in Section 11.3.2 and summarized visually in the bottom row of Figure 11.25, with real data this general configuration (setting the capacity dial to the right, and adjusting optimization dial) allows our *optimization dial* to (roughly speaking) govern the complexity of a tuned model based on how well we optimize (the more we optimize a high capacity model the higher its complexity with respect to the training data becomes).  In other words with this configuration our optimization dial (roughly speaking) becomes the sort of fine resolution *complexity* dial we aimed to construct at the outset of the Chapter (see Section 11.1).  With our optimization dial turned all the way to the left we begin our search with a low complexity tuned model (called $\text{model}_1$) consisting of a single unit of a universal approximator having its parameters fully optimized.  As we progress through rounds of boosting we turn the optimization dial gradually from left to right (here each notch on the optimization dial denotes the complete optimization of one additional unit) optimizing (to completion) a single weighted unit of our original high capacity model, so that at the $m^{th}$ round our tuned model (called $\text{model}_m$) consists of $m$ individually but fully tuned units.  

Our ultimate aim in doing this is of course to determine a setting of the optimization dial / determine an appropriate number of tuned units that minimizes validation error.  We visualize this concept in [Figure 11.33](#figure-11-33) by wrapping the prototypical validation error curve (shown in its top panel of this Figure) around the optimization dial (shown in its bottom right panel) from left to right, as well as the generic markings denoting underfitting and overfitting.  By using validation error we automate the process of setting our optimization dial to the proper setting - where validation error is minimal.

Whether we use fixed-shape, neural network, or tree-based units with boosting we will naturally prefer units with *low capacity* so that the resolution of our model search is as fine-grained as possible.

<b>(left panel)</b> Using <u>low-capacity units</u> makes the boosting procedure a high (or fine) resolution search for optimal model complexity. <b>(right panel)</b> Using <u>high-capacity units</u> makes boosting a low (or coarse) resolution search for optimal model complexity.<br>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_33.png' width="100%"  alt=""/>
</p>


---

This is depicted in the left panel of [Figure 11.33](#figure-11-33). If we used *high capacity* units at each round of boosting the resulting model search will be much coarser, as adding each additional unit results in aggressively turning the dial from left to right leaving large gaps in our model search, as depicted in the right panel of [Figure 11.33](#figure-11-33).   This kind of low resolution search could easily result in us skipping over the complexity of an optimal model.  The same can be said as to why we add only one unit at-a-time with boosting, tuning its parameters alone at each round. If we added more than one unit at-a-time, or if we re-tuned *every* parameter of *every* unit at each step of this process not only would we have significantly more computation to perform at each step but the performance difference between subsequent models could be quite large and we might easily miss out on an ideal model.

## Technical details

For boosting we need a set of $M$ nonlinear features or units from a single family of universal approximators

\begin{equation}
\mathcal{F} = \{f_{1}\left(\mathbf{x}\right),\,f_{2}\left(\mathbf{x}\right),\,\ldots,\,f_M\left(\mathbf{x}\right)\}.
\end{equation}

We add these units sequentially or one-at-a-time building a set of $M$ tuned models $\left[\text{model}_m\right]_{m=1}^M$ that increase in complexity with respect to the training data, from $m=1$ to $m=M$, ending with a generic nonlinear model composed of $M$ units.

We will express this final boosting-made model as

$$
\text{model}\left(\mathbf{x},\Theta\right) = w_0 + f_{s_1}\left(\mathbf{x}\right){w}_{1} +  f_{s_2}\left(\mathbf{x}\right){w}_{2} + \cdots + f_{s_M}\left(\mathbf{x}\right)w_{M}.
\label{equation:final-boosting-model}$$

Here we have re-indexed the individual units to $f_{s_m}$ (and the corresponding weight $w_{m}$) to denote the unit from the entire collection in $\mathcal{F}$ added at the $m^{th}$ round of the boosting
process. The linear combination weights $w_0$ through $w_M$ along with any additional
weights internal to $f_{s_1},\,f_{s_2},\,\ldots,\,f_{s_M}$ are
represented collectively in the weight set $\Theta$.

- The process of boosting is performed in a total of $M$ rounds.
- At each round we determine which unit, when added to the running model, best lowers its training error.
- We then measure the corresponding validation error provided by this update, and in the end after all rounds of boosting are complete, use the lowest validation error measurement found to decide which round provided the best overall model.

- For the sake of simplicity, we only discuss nonlinear regression on the training dataset $\left\{\left(\mathbf{x}_p,\,y_p\right)\right\}_{p=1}^P$ employing the Least Squares cost.

- However, the principles of boosting we will see remain *exactly* the same for other learning tasks (e.g., two-class and multi-class classification) and their associated costs.

### Round 0 of boosting

Round 0 starts with the model 

\begin{equation}
\text{model}_0^{\,}\left(\mathbf{x},\Theta_0\right) = w_0^{\,}
\end{equation}

whose weight set $\Theta_0^{\,} = \left\{ w_0\right\}$ contains a
single bias weight, whose optimal value $w_0^{\star}$ is found by by minimizing the Least Squares cost

\begin{equation}
\frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_0^{\,}\left(\mathbf{x}_p,\Theta_0 \right)  - \overset{\,}{y}_{p}^{\,}\right)^{2} =  \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\,}   - \overset{\,}{y}_{p}^{\,}\right)^{2}
\end{equation}

We fix the bias weight at the value of $w_0^{\star}$ forever more throughout the
process.

### Round 1 of boosting



Having tuned the only parameter of $\text{model}_0$ we now *boost* its complexity by adding the weighted unit
$f_{s_1}\left(\mathbf{x}\right)w_{1}$ to it:

\begin{equation}
\text{model}_1^{\,}\left(\mathbf{x},\Theta_1^{\,}\right) = \text{model}_0^{\,}\left(\mathbf{x},\Theta_0^{\,}\right) + f_{s_1}\left(\mathbf{x}\right)w_{1}^{\,}.
\end{equation}

To determine which unit in our set $\mathcal{F}$ best lowers the training error, we press
$\text{model}_1$ against the data by minimizing the following cost for every unit $f_{s_1} \in \mathcal{F}$

\begin{equation}
 \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_0^{\,}\left(\mathbf{x}_p,\Theta_0^{\,}\right) + f_{s_1}\left(\mathbf{x}_p^{\,}\right)w_1   - {y}_{p}^{\,}\right)^{2} \\=  \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + f_{s_1}\left(\mathbf{x}_p\right)w_1  - \overset{\,}{y}_{p}^{\,}\right)^{2}
\end{equation}



### Round $m>1$ of boosting

We begin with $\text{model}_{m-1}$ consisting of a bias term and $m-1$ units of the form
    
$$
\text{model}_{m-1}\left(\mathbf{x},\Theta_{m-1}\right) = w_0^{\star} + f_{s_1}^{\star}\left(\mathbf{x}\right){w}_{1}^{\star}  + \cdots + f_{s_{m-1}}^{\star}\left(\mathbf{x}\right)w_{m-1}^{\star}.
\label{equation:cv_boosting_model_m_minus_1}
$$



We then seek out the best weighted unit 
$f_{s_m}\left(\mathbf{x}\right)w_{m}$ to add to our running model

\begin{equation}
\text{model}_m^{\,}\left(\mathbf{x},\Theta_m^{\,}\right) = \text{model}_{m-1}^{\,}\left(\mathbf{x},\Theta_{m-1}^{\,}\right) + f_{s_m}\left(\mathbf{x}\right)w_{m}
\label{equation:boosting-round-m-model-version-1}
\end{equation}

by minimizing the following cost over $w_m$, $f_{s_m}$ and its internal parameters (if they exist)

\begin{equation}
\begin{array}
\
 \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_{m-1}^{\,}\left(\mathbf{x}_p,\Theta_{m-1}^{\,}\right) + f_{s_m}\left(\mathbf{x}_p^{\,}\right)w_m   - {y}_{p}^{\,}\right)^{2} = \\ \frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + w_1^{\star}f^{\star}_{s_{1}}  + \cdots + f^{\star}_{s_{m-1}}\left(\mathbf{x}_p\right)w_{m-1}^{\star} + f_{s_m}\left(\mathbf{x}_p\right)w_m  - \overset{\,}{y}_{p}^{\,}\right)^{2}
 \end{array}
\end{equation}



Note again the form of the following cost function which we have to minimize over $w_m$ and $f_{s_m}$: 

$$
\frac{1}{P}\sum_{p=1}^{P}\left(w_0^{\star} + w_1^{\star}f^{\star}_{s_{1}}  + \cdots + f^{\star}_{s_{m-1}}\left(\mathbf{x}_p\right)w_{m-1}^{\star} + f_{s_m}\left(\mathbf{x}_p\right)w_m  - \overset{\,}{y}_{p}^{\,}\right)^{2}
$$

- If we use <b><u>fixed-shape</u></b> or <b><u>tree-based</u></b> approximators, this entails solving $M$ (or $M-m+1$, if we decide to check only those units not used in previous rounds) such optimization problems and choosing the one with smallest training error.

- If we use <b><u>neural networks</u></b>, since each unit takes the same form, we need only solve one such optimization problem.

## Early stopping

Once all rounds of boosting are complete note how we have generated a
sequence of \(M\) tuned models - denoted $\left[\text{model}_m\left(\mathbf{x},\Theta_m^{\,}\right)\right]_{m=1}^M$ - 
which gradually increases in nonlinear complexity from $m = 1$ to
$m = M$, and thus gradually decrease in training error. This gives us fine-grained control in selecting an appropriate model, as the jump in performance in terms of both the training and validation
errors between subsequent models in this sequence can be quite smooth in the sequence, provided we use low capacity units (as discussed in Section 11.5.1).

- Once boosting is complete we select from our set of models the one that provides the lowest validation error.

- Alternatively, instead of running all rounds of boosting and deciding on an optimal model after the fact, we can attempt to *halt* the procedure when the validation error first starts to increase.

- This concept, referred to as *early stopping*, leads to a more computationally efficient implementation of boosting.

- However, one needs to be careful in deciding when the validation error has really reached its minimum as it can oscillate up and down multiple times.

## An inexpensive but effective enhancement

- At the $m^{th}$ round of boosting we can add an additional bias weight $w_{0, m}$ as

\begin{equation}
\text{model}_m^{\,}\left(\mathbf{x},\Theta_m^{\,}\right) = \text{model}_{m-1}^{\,}\left(\mathbf{x},\Theta_{m-1}^{\,}\right) + w_{0, m} + f_{s_m}\left(\mathbf{x}\right)w_{m}.
\label{equation:boosting-round-m-model-version-2}
\end{equation}

- This simple adjustment results in greater flexibility and generally better overall performance by allowing units to be adjusted 'vertically' at each round.

- Note that once tuning is done, the optimal bias weight $w^{\star}_{0, m}$ can be absorbed into the bias weights from previous
rounds.

- This enhancement is particularly useful when using fixed-shape or neural network units for boosting, and is redundant
when using tree-based approximators.

This enhancement is particularly useful when using fixed-shape or neural network units for boosting, as it is redundant
when using tree-based approximators because they already have individual bias terms baked into them that always allow for this kind of vertical adjustment at each round of boosting (in the jargon of machine learning boosting with tree-based learners is often referred to as *gradient boosing* - see Section 14.5.  To see this note that while the most common way of expressing a stump taking in $N=1$ dimensional input is 

\begin{equation}
f\left(x\right)=\begin{cases}
\begin{array}{c}
v_{1}\\
v_{2}
\end{array} & \begin{array}{c}
x<s \\
x>s
\end{array}\end{cases}
\end{equation}

it is also possible to express $f(x)$ equivalently as

\begin{equation}
f(x) = b + w\,h(x)
\label{equation:stump-chapter-14-bias-feature-weight-version}
\end{equation}

where $b$ denotes an individual bias parameter for the stump and $w$ is an associated weight that scales $h(x)$, which is a simple step function with fixed levels and a split at $x=s$

\begin{equation}
h\left(x\right)=\begin{cases}
\begin{array}{c}
0 \\
1
\end{array} & \begin{array}{c}
x<s\\
x>s
\end{array}\end{cases}
\end{equation}

Expressing the stump in this equivalent manner allows us to see that every stump unit does indeed have its own individual bias parameter, making it redundant to add an individual bias at each round when boosting with stumps (and the same holds for stumps taking in general $N$ dimensional input as well).

---

#### <span style="color:#a50e3e;">Example.</span>  Boosting regression using stump units

- In the following animation we illustrate the result of $M = 100$ rounds of boosting using a set of $B = 20$ stumps (many of the stumps are used multiple times) on a toy regression dataset.

- Data is split into $\frac{2}{3}$ training (blue) and $\frac{1}{3}$ validation (yellow).

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# import data
csvname = datapath + 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:] 

# import booster
mylib = nonlib.intro_boost_library.stump_booster.Setup(x,y)

# choose normalizer
mylib.choose_normalizer(name = 'standard')

# pick training set
mylib.make_train_valid_split(train_portion=0.66)

# choose cost|
mylib.choose_cost(name = 'least_squares')

# choose optimizer
mylib.choose_optimizer('newtons_method',max_its=1)

# run boosting
mylib.boost(51)

# produce animation
csvname = datapath + 'noisy_sin_sample.csv'
frames = 50
anim = nonlib.boosting_regression_animators.Visualizer(csvname)
savepath = 'videos/animation_7.mp4'
anim.animate_trainval_boosting(savepath,mylib,frames,fps=5)

# load in video and display
from IPython.display import HTML
HTML("""
<video width="800" height="400" controls loop>
  <source src="videos/animation_8.mp4" type="video/mp4">
  </video>
""")

## 11.5.5 Similarity to feature selection

The careful reader will notice how similar the boosting procedure is to the one introduced in [Section 9.6](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_6_Boosting.html) in the context of feature selection. Indeed principally the two approaches are entirely similar, except with boosting we do not select from a set of given input features but create them ourselves based on a chosen universal approximator family. Additionally, unlike feature selection where our main concern is *human interpret-ability*, we primarily use boosting as a tool for cross-validation. This means that unless we specifically prohibit it from occurring, we can indeed select the same feature multiple times in the boosting process as long as it contributes positively towards finding a model with minimal validation error.

These two use-cases for boosting, i.e., feature selection and
cross-validation, can occur together, albeit typically in the context
of linear modeling as detailed in [Section 9.6](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_6_Boosting.html). Often in such instances
cross-validation is used with a linear model as a way of automatically
selecting an appropriate number of features, with human interpretation of the resulting selected features still in mind. On the other hand, rarely is feature selection done when employing a nonlinear model based on features from a universal approximator due to the great difficulty in the human interpret-ability of nonlinear features. The rare exception to this rule is when using tree-based units which, due to their simple structure, can in particular instances be readily interpreted by humans.

## The residual perspective with regression

Consider the following Least Squares cost function where we have inserted a boosted
model at the $m^{th}$ round of its development

\begin{equation}
g\left(\Theta_m^{\,}\right) = \frac{1}{P}\sum_{p=1}^{P}\left(\text{model}_m^{\,}\left(\mathbf{x}_p,\Theta_m^{\,}\right) - y_p\right)^2.
\label{equation:boosting-LS-cost}
\end{equation}

We can write our boosted model recursively as

\begin{equation}
\text{model}_m^{\,}\left(\mathbf{x}_p^{\,},\Theta_m^{\,}\right) = \text{model}_{m-1}^{\,}\left(\mathbf{x}_p^{\,},\Theta_{m-1}^{\,}\right) + f_m^{\,}\left(\mathbf{x}_p\right)w_m^{\,}
\label{equation:boosting-LS-recursive-model}
\end{equation}

where all of the parameters of the $\left(m-1\right)^{th}$ model (i.e., $\text{model}_{m-1}$) are already tuned.

Combining the equations of the previous page we can re-write the Least Squares cost as

\begin{equation}
g\left(\Theta_m^{\,}\right) = \frac{1}{P}\sum_{p=1}^{P}\left(f_m^{\,}\left(\mathbf{x}_p^{\,}\right)w_m^{\,} - \left(y_p^{\,} - \text{model}_{m-1}^{\,}\left(\mathbf{x}_p^{\,}\right)\right)\right)^2.
\end{equation}

By minimizing this cost we look to tune the parameters of a
single additional unit so that for all $p$ we have

\begin{equation}
f_m^{\,}\left(\mathbf{x}_p\right)w_m^{\,}\approx y_p^{\,} - \text{model}_{m-1}^{\,}\left(\mathbf{x}_p^{\,}\right)
\end{equation}

The RHS is the difference between our original output and the contribution of the $\left(m-1\right)^{th}$ model, is often called the *residual*: it is what is left to represent after subtracting off what was learned by the $\left(m-1\right)^{th}$ model.

---

#### <span style="color:#a50e3e;">Example.</span>  Boosting regression and the 'fitting to the residual' perspective

In the animation below we illustrate the process of boosting $M = 5000$ single-layer $\texttt{tanh}$ units to a toy regression dataset.

- **left panel** shows the dataset along with the fit provided by $\text{model}_m$ at the $m^{th}$ step of boosting for select values of $m$.
- **right panel** shows the *residual* at the same step, as well as the fit provided by the corresponding $m^{th}$ unit $f_m$.

In [4]:
## This code cell will not be shown in the HTML version of this notebook
import copy
# load in dataset
csvname = datapath + 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = copy.deepcopy(data[:-1,:])
y = copy.deepcopy(data[-1:,:] )

# boosting procedure
num_units = 15
runs2 = []
for j in range(num_units):    
    # import the v1 library
    mylib2 = nonlib.intro_boost_library.net_booster.Setup(x,y)
    
    # choose normalizer
    mylib2.choose_normalizer(name = 'standard')

    # choose normalizer
    mylib2.make_train_valid_split(train_portion = 1)

    # choose cost
    mylib2.choose_cost(name = 'least_squares')

    # choose optimizer
    mylib2.choose_optimizer('gradient_descent',max_its=10000,alpha_choice = 10**(-1))
    
    # choose activation 
    mylib2.choose_activation(activation = 'relu')
    
    # run boosting
    mylib2.boost(1,verbose=False)
    mylib2.model = mylib2.models[-1]

    # add model to list
    runs2.append(copy.deepcopy(mylib2))
    
    # cut off output given model
    normalizer = mylib2.normalizer
    ind = np.argmin(mylib2.train_cost_vals[0])
    y_pred =  mylib2.models[-1](mylib2.normalizer(x))
    y -= y_pred

# animate the business
frames = num_units
demo2 = nonlib.boosting_regression_animators_v3.Visualizer(csvname)
savepath='videos/animation_9.mp4'
demo2.animate_boosting(savepath,runs2,frames,fps=2)

# load in video and display
from IPython.display import HTML
HTML("""
<video width="800" height="400" controls loop>
  <source src="videos/animation_9.mp4" type="video/mp4">
  </video>
""")

---