## Chapter 11: Principles of Feature Learning

# 11.6 Efficient Cross-Validation via Regularization

You can toggle the code on and off in this presentation via the button below.

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In the previous Section we saw how with boosting based cross-validation we automatically learn the proper level of model complexity for a given dataset by optimizing a general high capacity model one unit at-a-time.   In this Section we introduce what are collectively referred to as *regularization* techniques for efficient cross-validation.  With this set of approaches we once again start with a single high capacity model, and once again adjust its complexity with respect to a training dataset via careful optimization.  However, with regularization we tune all of the units *simultaneously*, controlling how well we *optimize* its associated cost so that a minimum validation instance of the model is achieved. 

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'

# plotting tools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## 11.6.1 The big picture

Imagine for a moment that we have a simple nonlinear regression dataset, like the one shown in the top-left panel of the [Figure 11.37](#figure-11-37), and we use a high capacity model (relative to the nature of the data) made up of a sum of *universal approximators* of a single kind to fit it as

\begin{equation}
\text{model}\left(\mathbf{x},\Theta\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} +  f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_M\left(\mathbf{x}\right)w_M.
\label{equation:regularization-original-construct}
\end{equation}

 Suppose then that we partition this data into training and validation portions, and then train our high capacity model by *completely* optimizing the Least Squares cost over the training portion of the data.  In other words, we determine a set of parameters for our high capacity model that lie very close to a global minimum of its associated cost function.  In the top-right panel of the Figure we draw a hypothetical two dimensional illustration of the cost function associated with our high capacity model over the training data, denoting the global minimum by a blue dot and its evaluation on the function by a blue 'x'.  

Since our model has high capacity, the resulting fit provided by the parameters lying at the global minimum of our cost will produce a tuned model that is overly complex and *severely* overfits the training portion of our dataset.  In the bottom-left panel of the [Figure 11.37](#figure-11-37) we show the tuned model fit (in blue) provided by such a set of parameters, which wildly overfits the training data.  In the top-right panel we also show a set of parameters lying relatively near the global minimum as a yellow dot, and whose evaluation of the function is shown as a yellow 'x'.  This set of parameters lying in the general neighborhood of the global minimum is where the cost function is minimized over the *validation* portion of our data.  Because of this the corresponding fit (shown in the bottom-right panel in yellow) provides a much better representation of the data.

---

<a id='figure-11-37'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_37.png' width="70%"  alt=""/>
</p>
<figcaption> <strong>Figure: 11.37 </strong> <em> 
(top row) (top-left panel) A generic nonlinear regression dataset. (top-right panel) A figurative illustration of the cost function associated with a high capacity model over the training portion of this data.  The global minimum is marked here with a blue dot (along with its evaluation by a blue 'x') and a point nearby is marked in yellow (and whose evaluation is shown as a yellow 'x').  (bottom-left panel) The original data and fit (in blue) provided by the model using parameters from the global minimum of the cost function severely overfits the training portion of the data.  (bottom-right panel)  A fit provided by the parameters corresponding to the yellow dot shown in the top-right panel minimize the cost function over the validation portion of the data, and here provide a much better fit (in yellow) to the data.  See text for further details.
</em>
</figcaption>
</figure>

---

#### Overfitting occurs when:
- the *capacity* of a machine learning model is too high
- *and*, its corresponding cost function (over the training data) is *optimized* too well.


#### Regularization - a potential solution:
- we set the model parameters purposefully away from the global minimum of its associated cost function, so as to find where the validation error (<u>not</u> training error) is at its lowest. 

### Two most popular approaches to regularization:
- Early stopping
- Adding a regularizer to the cost function


<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_38.png' width="70%"  alt=""/>
</p>

**(left panel)** A figurative illustration of early stopping regularization: we stop the optimization prematurely at the yellow point (where validation error is minimal).<br>
**(right panel)** A figurative illustration of regularizer based regularization: by adding a regularizer function to the cost associated of a high capacity model we change its shape, in particular dragging its global minimum where overfitting behavior occurs away from its original.

## Early stopping based regularization

With *early stopping regularization* we properly tune a high capacity model by making a run of local optimization (tuning all parameters of the model), and by using the set of weights from this run where the model achieves minimum validation error.  This idea is illustrated in the left panel of [Figure 11.38](#figure-11-38) where we employ the same prototypical cost function (associated with a high capacity model) first shown in the top-right panel of [Figure 11.38](#figure-11-38).

Here again we mark its global minimum and set of validation error minimizing weights in blue and yellow, respectively (as detailed originally in [Figure 11.37](#figure-11-37). During a run of local optimization we frequently compute training and validation errors (e.g., at each step of the optimization procedure). Thus, depending on the optimization procedure used (as detailed further below) a set of weights providing minimum validation error for a high capacity model can be determined with fine resolution.

This regularization approach is especially popular when employing high capacity deep neural network models as detailed in Section 13.7.

---

Whether or not one literally stops the optimization run when minimum validation error has been reached (which can be challenging in practice given the somewhat unpredictable behavior of validation error as first noted in Section 11.4) or one runs the optimization to completion (picking the best set of weights afterwards), in either case we refer to this method as early stopping regularization.  Note that the method itself is analogous to the early stopping procedure outlined for boosting based cross-validation in Section 11.5 in that we sequentially increase the complexity of a model until minimum validation is reached.   However, here (unlike boosting) we do this by controlling how well we optimize a model's parameters *simultaneously*, as opposed to one unit at-a-time. 



Supposing that we begin our optimization with a small initial value (which we typically do; see for example, Section 3.6) the corresponding training and validation error curves will in general\footnote{Note that both can oscillate in practice depending on the optimization method used.} look like those shown in top panel of [Figure 11.38](#figure-11-38). At the start of the run the complexity of our model (evaluated at, for instance, the initial weights) is quite small, providing a large training and validation error.  As minimization proceeds, and we continue optimizing one step at-a-time, error in both training and validation portions of the data decreases while the complexity of the tuned model increases.  This trend continues up until a point when the model complexity becomes too great and overfitting begins, and validation error increases.  

---

<figure>
<p>
  <img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_39.png' width="60%"  alt=""/>
</p>
    
**(top panel)** A prototypical pair of training/validation error curves associated with a generic run of the early stopping regularization.<br><br>
**(bottom panels)** We set our capacity dial all the way to the right and our optimization dial all the way to the left, slowly moving it from left to right in search of a model with minimum validation error. Here each notch on the optimization dial abstractly denotes a step of local optimization.    

---

In terms of the capacity/optimization dial scheme detailed in the context of real data in Section 11.3.2, we can think of (early stopping based) regularization as beginning with our capacity dial set all the way to the right (since we employ a high capacity model) and our optimization dial all the way to the left (at the initialization of our optimization).  With this configuration - summarized visually in the bottom panel of [Figure 11.39](#figure-11-39) - we allow our optimization dial to (roughly speaking) directly govern the amount of complexity our tuned models can take (here each notch on the optimization dial denotes a single step of local optimization).  In other words, with this configuration our optimization dial becomes (roughly speaking) the ideal complexity dial described at the start of the Chapter in Section 11.1.  With early stopping we turn our optimization dial from left to right, starting at our initialization making a run of local optimization one step at-a-time, seeking out a set of parameters that provide minimum validation error for our (high capacity) model.  This is illustrated in the bottom panels of Figure 11.39, where we see our capacity dial set all the way to the right and our generic validation error curve wrapped around our optimization dial (as it now, roughly speaking, controls the complexity of each tuned model).

With an initial set of parameters $\Theta_0$, taking $M$ steps of a local optimization produces a sequence of $M+1$ parameter settings $\left\{\Theta_m\right\}_{m=0}^M$ for our model, or similarly (ignoring the initialization for the sake of illustration) a set of $M$ models of generally *increasing* complexity with respect to the training data $\left\{\text{model}\left(\textbf{x},\Theta_m\right)\right\}_{m=1}^M$.  

---

<a id='figure-11-40'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_40_bottom.png' width="90%"  alt=""/>
</p>
<figcaption> <strong>Figure: 11.40 </strong> <em> 
Two subtleties associated with early stopping based regularization.  (top row) (left panel) A prototypical cost function associated with a high capacity model, with two optimization paths (shown in red and green, respectively) resulting from two local optimization runs beginning at different starting points.  (middle panel) The validation error histories corresponding to each optimization run.  (right panel) While each run produces a different set of optimal weights, and a different fit to the data (here shown in green and red respectively, corresponding to each run), these fits are generally equally representative.  (bottom row) (left panel) Taking optimization steps with a small steplength makes the early stopping procedure a fine-resolution search for optimal model complexity.  With such small steps we smoothly turn the optimization dial from left to right in search of a validation error minimizing tuned model.  (right panel)  Using steps with a large steplength makes early stopping a coarse resolution search for optimal model complexity.  With each step taken we aggressively turn the dial from left to right, performing a coarser resolution model search that potentially skips over the optimal model.
</em>
</figcaption>
</figure>

---

There are a number of important engineering details associated with
implementing an effective early stopping regularization procedure, including the following:

### Is the model found by early stopping unique?

Different initializations can produce different trajectories towards potentially different minima of the cost function, and produce corresponding validation error minimizing models that differ in shape.

<br>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_40_top.png' width="90%"  alt=""/>
</p>

### How high should capacity be set?

While there is no clear-cut answer to this question, the capacity must simply be set at least 'high' enough that the model overfits if optimized completely.

-  **Local optimization must be carefully performed.** One must be careful with the sort of local optimization scheme used with early stopping cross-validation.  As illustrated in the bottom row of [Figure 11.40](#figure-11-40), ideally we want to turn our optimization dial smoothly from left to right, searching over a set of model complexities with a fine resolution (depicted visually in the bottom-left panel of the Figure).  This means - for example - that with early stopping we often avoid local optimization schemes that take very large steps (e.g., Newton's method - as detailed in Chapter 2) as this can result in a coarse low resolution search over model complexity that can easily skip over minimum validation models, as depicted in the bottom-right panel of the Figure.  Local optimizers that take smaller, high quality steps - like the advanced first order methods detailed in Chapter 3 - are often preferred when employing early stopping.  Moreover when employing a mini-batch/stochastic first order methods (see Section 3.11) validation error should be measured *several times per epoch* to avoid taking too large of steps without measuring validation error.

### When is validation error really at its lowest?

- While generally speaking validation error decreases at the start of an optimization run and eventually increases (making somewhat of a 'U' shape) it can certainly fluctuate up and down during optimization.

- Often in practice a reasonable engineering choice is made as to when to stop based on how long it has been since the validation error has *not* decreased. 

- Moreover, one need not truly halt a local optimization procedure to employ the thrust of early stopping, and can simply run the optimizer to completion and select the best set of weights from the run (that minimize validation error) after completion.  

---

## Regularizer based methods

A <b><u>regularizer</u></b> is a simple function that can be added to a machine learning cost for a variety of purposes:
- to prevent unstable learning
- as a natural part of relaxing the support vector machine and multi-class learning scenarios
- for feature selection
- and, for regularization (which we discuss here)

Denoting the original cost function by $g$, and the associated regularizer by $h$, the regularized cost is given as the linear combination of $g$ and $h$ as 

\begin{equation}
    g\left(\Theta \right) + \lambda\, h\left(\Theta\right)
    \label{equation:cross-validation-general-regularized-cost}
\end{equation}

where $\lambda$ is referred to as the *regularization parameter*.


The regularization parameter 
- is always non-negative $\lambda \geq 0$ 
- controls the mixture of the cost and regularizer 
- when set very small, the regularized cost is essentially just $g$
- when set very large the regularizer $h$ dominates in the linear combination and drowns out $g$

---


  <img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/11_41.png' width="55%"  alt=""/>
<b>(top panel)</b> A prototypical pair of training/validation error curves associated with a generic run of regularizer based cross-validation. <br><br>

<b>(bottom panels)</b> Initially, the capacity dial is set all the way to the right and the optimization dial all the way to the left. We then slowly move our optimization dial from left to right by decreasing the value of $\lambda$, gradually increasing the complexity of our tuned model, in search of a tuned model with minimum validation error.

---

Supposing that we begin with a large value of $\lambda$ and try progressively smaller values (completely optimizing each regularized cost) - the corresponding training and validation error curves will in general look something like those shown in the top panel of [Figure 11.41](#figure-11-41) (remember in practice that *validation error* can oscillate, and need not take just one dip down). At the start of this procedure, using a large value of $\lambda$, the complexity of our model is quite small as the regularizer completely dominated in the regularized cost, and thus the associated minimum recovered belongs to the regularizer and not the cost function itself.  Since the set of weights is virtually unrelated to the data we are training over the corresponding model will tend to have large training and validation errors.  As $\lambda$ is decreased the parameters provided by complete minimization of the regularized cost will be closer to the global minima of the original cost itself, and so error on both training and validation portions of the data decreases while (generally speaking) the complexity of the tuned model increases.  This trend continues up until a point when the regularization parameter is small enough that the recovered parameters lie too close to that of the original cost, so that the corresponding model complexity becomes too great.  Here overfitting begins and validation error increases.

In terms of the capacity/optimization dial scheme detailed in the context of real data in Section 11.3.2, we can think of (regularizer-based) regularization as beginning with our capacity dial set to the *right* (since we employ a high capacity model) and our optimization dial all the way to the *left* (employing a large value for $\lambda$ in our regularized cost).  Here each notch on the optimization dial represents the complete minimization of the regularized cost function) for a given value of $\lambda$ - thus when the dial is turned all the way to the right (where $\lambda = 0$) we completely minimize the original cost.  With this configuration (summarized visually in the bottom panel of Figure 11.41) we allow our optimization dial to (roughly speaking) directly govern the amount of complexity our tuned models can take (here each setting of the capacity dial defines a model and each setting of the optimization dial a set of parameters of that model).  As we turn our optimization dial from left to right we *decrease* the value of $\lambda$ and *completely minimize* the corresponding regularized cost, seeking out a set of parameters that provide minimum validation error for our (high capacity) model.  This is illustrated in the bottom panels of [Figure 11.41](#figure-11-41), where we see our capacity dial set all the way to the right and our generic validation error curve wrapped around our optimization dial (as it now, roughly speaking, controls the complexity of each tuned model).

With an set of $M$ values $\left\{\lambda_m\right\}_{m=1}^M$ for our regularization parameter $\lambda$, sorted from *largest to smallest* ($\lambda_1$ being the largest value chosen and $\lambda_M$ the smallest) this scheme produces a sequence of $M$ parameter settings $\left\{\Theta_m\right\}_{m=1}^M$ and corresponding models $\left\{\text{model}\left(\textbf{x},\Theta_m\right)\right\}_{m=1}^M$ of generally *increasing* complexity. Thus, formally speaking, we can see regularizer based regularization stopping falls into the general category of cross-validation techniques outlined in Section 11.4.

---

<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_42_bottom.png' width="90%"  alt=""/>
</p>
<b>(left panel)</b> Testing out a large range and number of values for the regularization parameter $\lambda$ results in a fine resolution search for validation error minimizing weights.<br><br>
<b>(right panel)</b> A smaller number or a poorly chosen range of values can result in a coarse search that can skip over ideal weights.

### Implementation notes 

- Bias weights are often not included in the regularizer.

- How many values we can try to tune the regularization parameter is often limited by computation and time restrictions, since for every value of $\lambda$ tried a complete minimization of a corresponding regularized cost function must be performed.

- While the squared $\ell_2$ norm is a very popular regularizer, one can use - in principle - any simple function as a regularizer. A few examples follow.

### Popluar regularizers in machine learning

- **The squared $\ell_2$ norm:** defined as $\Vert \mathbf{w} \Vert_2^{2} = \sum_{n=1}^{N} w_{n}^2$, this regularizer incentivizes weights to be <u>*small*</u> (in Euclidean sense).  

- **The $\ell_1$ norm:** defined as $\Vert \mathbf{w} \Vert_1 = \sum_{n=1}^{N} \vert w_{n}\vert$, this regularizer tends to produce <u>*sparse*</u> weights.

- **The total variation norm:** defined as$\Vert \mathbf{w} \Vert_{\text{TV}} = \sum_{n=1}^{N-1} \vert w_{n+1} - w_n\vert$, this regularizer tends to produce <u>*smoothly varying*</u> weights. 

<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_42_top.png' width="90%"  alt=""/>
</p>

A visual depiction of where the $\ell_2$ **(left panel)**, $\ell_1$ **(middle panel)**, and total variation **(right panel)** functions pull global minima when used as a regularizer.

Below we illustrate the use of regularizer based regularization on a simple example.

---

#### <span style="color:#a50e3e;">Example.</span>   Tuning a regularization parameter for regression

- In this example we use an $\ell_2$-regularized degree-10 polynomial model.
 
- The training set/error is shown in light blue while the validation set/error is shown in yellow.
 
- We try out $100$ values of $\lambda$ (i.e., the regularization parameter) between $0$ and $1$. 

In [36]:
# This code cell will not be shown in the HTML version of this notebook
import copy

# load in dataset
csvname = datapath + 'noisy_sin_sample.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:] 

# start process
num_units = 100
degree = 10
train_portion = 0.66
lambdas = np.linspace(0,1,num_units)

runs1 = []
w = 0
for j in range(num_units):
    lam = lambdas[j]
    
    # initialize with input/output data
    mylib1 = nonlib.intro_general_library.superlearn_setup.Setup(x,y)
    
    # define feature transforms
    mylib1.choose_features(name = 'polys',degree = degree)

    # standard normalize input
    mylib1.choose_normalizer(name = 'standard')

    # split into training and validation sets
    if j == 0:
        # make training testing split
        mylib1.make_train_valid_split(train_portion = train_portion)
        train_inds = mylib1.train_inds
        valid_inds = mylib1.valid_inds

    else: # use split from first run for all further runs
        mylib1.x_train = mylib1.x[:,train_inds]
        mylib1.y_train = mylib1.y[:,train_inds]
        mylib1.x_valid = mylib1.x[:,valid_inds]
        mylib1.y_valid = mylib1.y[:,valid_inds]
        mylib1.train_inds = train_inds
        mylib1.valid_inds = valid_inds
        mylib1.train_portion = train_portion

    # choose cost
    mylib1.choose_cost(name = 'least_squares',lam=lam)

    if j == 0:
        # fit an optimization
        mylib1.fit(optimizer = 'newtons_method',max_its = 1,verbose = False)
    else:
        mylib1.fit(optimizer = 'newtons_method',max_its = 1,verbose = False,w=w,epsilon=10**(-12))

    # add model to list
    runs1.append(copy.deepcopy(mylib1))
    w = mylib1.w_init
    
# animate the business
frames = 100
csvname = datapath + 'noisy_sin_sample.csv'
demo1 = nonlib.regularization_regression_animators.Visualizer(csvname)
savepath = 'videos/animation_10.mp4'
demo1.animate_trainval_regularization(savepath,runs1,frames,num_units,show_history = True,fps = 10)

# load in video and display
from IPython.display import HTML
HTML("""
<video width="1000" height="400" controls loop>
  <source src="videos/animation_10.mp4" type="video/mp4">
  </video>
""")

---

## 11.6.4 Similarity to regularization for feature selection

Akin to the boosting procedure detailed in the previous Section, here the careful reader will notice how similar the regularizer based regularization framework described here is to the concept of regularization detailed for feature selection in [Section 9.7](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_7_Regularization.html).  The two approaches are very similar in theme, except here we do not select from a set of given input features but *create them ourselves based on a universal approximator*. Additionally, instead of our main concern with regularization being *human interpret-ability* of a machine learning model here we use regularization as a tool for cross-validation.