## Chapter 11: Principles of Feature Learning

# 11.4 Naive Cross-Validation

You can toggle the code on and off in this presentation via the button below.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

- By carefully searching through a set of models ranging in complexity we can identify the best of the bunch, the one that provides minimal error on the validation set.

- This comparison of models is called *cross-validation* (or sometimes *model search* or *selection*).

- Cross-validation is the basis of feature learning as it provides a systematic way to *learn* (as opposed to *engineer*, as detailed in Chapter 10) the proper form a nonlinear model should take for a given dataset.

In this Section we introduce what we refer to as *naive* cross-validation.  This consists of a search over a set of models of *varying capacity*, with each model fully optimized over the training set, in search of a validation error-minimizing choice.   While it is simple in principle and in implementation, naive cross-validation is generally speaking very expensive (computationally speaking) and often results in a rather *coarse* model search that can miss (or 'skip over') the ideal amount of complexity desired for a given dataset. 

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
from mlrefined_libraries import nonlinear_superlearn_library as nonlib
datapath = '../../mlrefined_datasets/nonlinear_superlearn_datasets/'

# plotting tools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

## The big picture

Suppose we want to select one of the $M$ models below that has the ideal amount of complexity for a given dataset:

\begin{equation}
\begin{array}
\
\text{model}_1\left(\mathbf{x},\Theta_1\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1}   \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,   \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,  \,\,\,  \,\,\,  \,\,\,      \,\,\,  \,\,    \\
\text{model}_2\left(\mathbf{x},\Theta_2\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} +  f_2\left(\mathbf{x}\right){w}_{2}   \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,   \,\,\,  \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,   \,\,\,  \,\,   \\
\,\,\,\,\,  \,\,\,\,\, \,\,\,\,\, \,\,\,\,\,  \,\,\,\,\,  \,\,\,\,\,  \,\,\,\,\,\vdots  \,\,\,  \,\,\,  \,\,\,  \,\,\,    \,\,\,  \,\,\,  \,\,\,  \,\,\,    \\
\text{model}_M\left(\mathbf{x},\Theta_M\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} +  f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_M\left(\mathbf{x}\right)w_M.
\end{array}
\label{equation:naive-cross-validation-model-set}
\end{equation}

Naive cross-validation entails taking the following steps:

- split original data randomly into training and validation portions
- optimize every model to completion
- measure the error of all $M$ fully trained models on each portion of the data
- pick the one that achieves *minimum validation*


---

<p>
  <img align="right" src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_28.png' width="60%"  alt=""/>
</p>

<b>(top panel)</b> A prototypical training and validation error plots resulting from a prototypical run of naive cross-validation. <br><br><b>(bottom panels)</b> We turn the capacity dial from left to right, searching over a range of models of increasing capacity in search of a validation error minimizing model, while keeping the optimization dial set all the way to the right.

---

In the top panel of [Figure 11.28](#11.28) we show the generic sort of training (in blue) and validation (in yellow) errors we find in practice as a result of following this naive cross-validation scheme. The horizontal axis of this plot shows (roughly speaking) the complexity of each of our $M$ *fully optimized* models, with the output on the vertical axis denoting error level.  As can be seen in the Figure, our low complexity models *underfit* the data as reflected in their high training and validation errors. As the model complexity increases further, fully optimized models achieve lower training error since increasing model complexity allows us to constantly improve how well we can represent training data. This fact is reflected in the monotonically decreasing nature of the (blue) training error curve. On the other hand, while the validation error of our models will tend to decrease at first as we increase complexity, this trend continues only up to a point where *overfitting* of the training data begins. Once we reach a model complexity that overfits the training data our validation error starts to increase again, as our model becomes less and less a fair representation of "data we might receive in the future" generated by the same phenomenon.  

Note in practice that while training error typically follows the monotonically decreasing trend shown in the top panel of Figure 11.28, validation error can oscillate up and down more than once depending on the models tested.  In any event, we determine the best fully optimized model from the set by choosing the one that *minimizes* validation error.  This is often referred to as solving the *bias-variance trade-off*, as it involves determining a model that (ideally) neither underfits (or has high bias) nor overfits (or has high variance). 

In the bottom row of [Figure 11.28](#figure-11-28) we summarize this naive approach to cross-validation using the capacity / optimization dial conceptualization first introduced in Section 11.3.2.  Here we set our *optimization* dial all the way to the right - indicating that we optimize every model to completion - and in ranging over our set of $M$ models we turn the *capacity* dial from left to right starting with $m=1$ (on the left) and ending with $m=M$ (all the way to the right - with the value of $m$ increasing by $1$ at each notch of the dial).  Since in this case the *capacity* dial roughly governs model complexity - as summarized visually in the bottom row of Figure 11.24 -  our model search reduces to setting this dial correctly to the minimum validation error setting.  To visually denote how this is done we wrap the prototypical validation error curve shown in the top panel of [Figure 11.28](#figure-11-28) clockwise around the capacity dial.  We can then imagine setting this dial correctly (and automatically) to the value of $m$ providing minimum validation error.

#### <span style="color:#a50e3e;">Example.</span>  Naive cross-validation and regression

- In the animation that follows we use naive cross-validation on a toy regression dataset by employing a small set of polynomial models having degrees $1 \leq m \leq 8$.

- These models are naturally ordered from low to high capacity, as we increase the degree $m$ of the polynomial.

- Here we use $\frac{2}{3}$ of the data points for training (blue), and the other $\frac{1}{3}$ for validation (yellow).


---

In [1]:
## This code cell will not be shown in the HTML version of this notebook
# run demonstration
demo3 = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo3.load_data(csvname)
demo3.brows_single_cross_val(savepath='videos/animation_6.mp4',basis='poly',num_elements = [v for v in range(1,9)],folds = 3,fps=1)

# load in video and display
from IPython.display import HTML
HTML("""
<video width="1000" height="500" controls loop>
  <source src="videos/animation_6.mp4" type="video/mp4">
  </video>
""")

---

#### <span style="color:#a50e3e;">Example.</span>   Naive cross-validation and classification

- In the animation that follows we use naive cross-validation on a toy classification dataset by employing a small set of polynomial models having degrees $1 \leq m \leq 7$.

- These models are naturally ordered from low to high capacity, as we increase the degree $m$ of the polynomial.

- Here we use (approximately) $\frac{4}{5}$ of the data points for training (blue), and the other $\frac{1}{5}$ for validation (yellow).

---

In [2]:
## This code cell will not be shown in the HTML version of this notebook
# load in dataset
csvname = datapath + 'new_circle_data.csv'
data = np.loadtxt(csvname,delimiter = ',')
x = data[:-1,:]
y = data[-1:,:] 

### run cross validation experiments ###
degrees = np.arange(1,8)
models_1 = []
for j in degrees:
    # import the v1 library
    mylib1 = nonlib.intro_general_library.superlearn_setup.Setup(x,y)

    # choose features
    mylib1.choose_features(name = 'polys',degree = j)
    
    # choose normalizer
    mylib1.choose_normalizer(name = 'none')

    # split into training and testing sets
    mylib1.make_train_valid_split(train_portion = 0.66)

    # choose cost
    mylib1.choose_cost(name = 'softmax')

    # fit an optimization
    mylib1.fit(optimizer = 'newtons_method',max_its = 5,epsilon = 10**(-8))

    # add model to list
    models_1.append(mylib1)

# load up animator
csvname = datapath + 'new_circle_data.csv'
demo2 = nonlib.crossval_classification_animator.Visualizer(csvname)

# animate based on the sample weight history
savepath = 'videos/animation_7.mp4'
demo2.animate_crossval_classifications(savepath,models_1,fps=1)

# load in video and display
from IPython.display import HTML
HTML("""
<video width="1000" height="500" controls loop>
  <source src="videos/animation_7.mp4" type="video/mp4">
  </video>
""")

---

## Problems with naive cross-validation

- Since the process generally involves trying out a range of models where each model is *optimized completely*, naive cross-validation can be very expensive computationally speaking. 

- The *capacity* difference between even adjacent models (e.g., those consisting of $m$ and $m+1$ units) can be quite large. Since each model is fully optimized this can lead to huge jumps in the range of model complexities tried out on a dataset, leading to a *coarse* resolution model search that can 'miss out' on an ideal amount of nonlinearity for a given dataset.  

---

<a id='figure-11-31'></a>
<figure>
<p>
  <img src= '../../mlrefined_images/nonlinear_superlearn_images/Figure_11_31.png' width="95%"  alt=""/>
</p>
<figcaption> <strong>Figure: 11.31 </strong> <em> 
Naive cross-validation depicted as the proper adjustment of the capacity dial.  While ideally we would like the resolution of our search to be fine-grained (as depicted in the left panel) often times it results - due to the very nature of the approach -  in a rather coarse search for validation error minimizing models (as depicted in the right panel - where large turns of the capacity dial easily skip over the validation error minimizing choice).
</em>
</figcaption>
</figure>

---