# Generalization: Model Validation
### [Neil D. Lawrence](http://inverseprobability.com), University of Sheffield
### 2015-10-27

**Abstract**: Generalization is the main objective of a machine learning algorithm.
The models we design should work on data they have not seen before.
Confirming whether a model generalizes well or not is the domain of
*model validation*. In this lecture we introduce approaches to model
validation such as hold out validation and cross validation.

$$
\newcommand{\tk}[1]{}
%\newcommand{\tk}[1]{\textbf{TK}: #1}
\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$

<!-- Front matter -->
<!-- Front matter -->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!---->
<!-- Do not edit this file locally. -->
<!--Back matter-->
<!-- Do not edit this file locally. -->
<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->
## Review

-   Last time: introduced basis functions.
-   Showed how to maximize the likelihood of a non-linear model that's
    linear in parameters.
-   Explored the different characteristics of different basis function
    models

## Alan Turing \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/alan-turing-marathon.md" target="_blank" >edit</a>\]

<table>
<tr>
<td width="50%">
<img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/turing-times.gif" style="width:100%">
</td>
<td width="50%">
<img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/turing-run.jpg" style="width:50%">
</td>
</tr>
</table>
Figure: <i>Alan Turing, in 1946 he was only 11 minutes slower than the
winner of the 1948 games. Would he have won a hypothetical games held in
1946? Source: [Alan Turing Internet
Scrapbook](http://www.turing.org.uk/scrapbook/run.html).</i>

If we had to summarise the objectives of machine learning in one word, a
very good candidate for that word would be *generalization*. What is
generalization? From a human perspective it might be summarised as the
ability to take lessons learned in one domain and apply them to another
domain. If we accept the definition given in the first session for
machine learning, $$
\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}
$$ then we see that without a model we can't generalise: we only have
data. Data is fine for answering very specific questions, like "Who won
the Olympic Marathon in 2012?", because we have that answer stored,
however, we are not given the answer to many other questions. For
example, Alan Turing was a formidable marathon runner, in 1946 he ran a
time 2 hours 46 minutes (just under four minutes per kilometer, faster
than I and most of the other [Endcliffe Park
Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km).
What is the probability he would have won an Olympics if one had been
held in 1946?

To answer this question we need to generalize, but before we formalize
the concept of generalization let's introduce some formal representation
of what it means to generalize in machine learning.

## Expected Loss \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/expected-loss.md" target="_blank" >edit</a>\]

Our objective function so far has been the negative log likelihood,
which we have minimized (via the sum of squares error) to obtain our
model. However, there is an alternative perspective on an objective
function, that of a *loss function*. A loss function is a cost function
associated with the penalty you might need to pay for a particular
incorrect decision. One approach to machine learning involves specifying
a loss function and considering how much a particular model is likely to
cost us across its lifetime. We can represent this with an expectation.
If our loss function is given as
$L(\dataScalar, \inputScalar, \mappingVector)$ for a particular model
that predicts $\dataScalar$ given $\inputScalar$ and $\mappingVector$
then we are interested in minimizing the expected loss under the likely
distribution of $\dataScalar$ and $\inputScalar$. To understand this
formally we define the *true* distribution of the data samples,
$\dataScalar$, $\inputScalar$. This is a particularl distribution that
we don't have access to very often, and to represent that we define it
with a variant of the letter 'P',
$\mathbb{P}(\dataScalar, \inputScalar)$. If we genuinely pay
$L(\dataScalar, \inputScalar, \mappingVector)$ for every mistake we
make, and the future test data is genuinely drawn from
$\mathbb{P}(\dataScalar, \inputScalar)$ then we can define our expected
loss, or risk, to be, $$
R(\mappingVector) = \int L(\dataScalar, \inputScalar, \mappingVector) \mathbb{P}(\dataScalar, \inputScalar) \text{d}\dataScalar
\text{d}\inputScalar.
$$ Of course, in practice, this value can't be computed *but* it serves
as a reminder of what it is we are aiming to minimize and under certain
circumstances it can be approximated.

## Sample Based Approximations

A sample based approximation to an expectation involves replacing the
true expectation with a sum over samples from the distribution.

$$
\int \mappingFunction(z) p(z) \text{d}z\approx \frac{1}{s}\sum_{i=1}^s \mappingFunction(z_i).
$$ if $\{z_i\}_{i=1}^s$ are a set of $s$ independent and identically
distributed samples from the distribution $p(z)$. This approximation
becomes better for larger $s$, although the *rate of convergence* to the
true integral will be very dependent on the distribution $p(z)$ *and*
the function $\mappingFunction(z)$.

That said, this means we can approximate our true integral with the sum,
$$
R(\mappingVector) \approx \frac{1}{\numData}\sum_{i=1}^{\numData} L(\dataScalar_i, \inputScalar_i, \mappingVector).
$$

if $\dataScalar_i$ and $\inputScalar_i$ are independent samples from the
true distribution $\mathbb{P}(\dataScalar, \inputScalar)$. Minimizing
this sum directly is known as *empirical risk minimization*. The sum of
squares error we have been using can be recovered for this case by
considering a *squared loss*, $$
L(\dataScalar, \inputScalar, \mappingVector) = (\dataScalar-\mappingVector^\top\boldsymbol{\phi}(\inputScalar))^2,
$$ which gives an empirical risk of the form $$
R(\mappingVector) \approx \frac{1}{\numData} \sum_{i=1}^{\numData}
(\dataScalar_i - \mappingVector^\top \boldsymbol{\phi}(\inputScalar_i))^2
$$ which up to the constant $\frac{1}{\numData}$ is identical to the
objective function we have been using so far.

## Estimating Risk through Validation \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/empirical-risk-minimization.md" target="_blank" >edit</a>\]

Unfortuantely, minimising the empirial risk only guarantees something
about our performance on the training data. If we don't have enough data
for the approximation to the risk to be valid, then we can end up
performing significantly worse on test data. Fortunately, we can also
estimate the risk for test data through estimating the risk for unseen
data. The main trick here is to 'hold out' a portion of our data from
training and use the models performance on that sub-set of the data as a
proxy for the true risk. This data is known as 'validation' data. It
contrasts with test data, because its values are known at the model
design time. However, in contrast to test date we don't use it to fit
our model. This means that it doesn't exhibit the same bias that the
empirical risk does when estimating the true risk.

## Validation \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/validation.md" target="_blank" >edit</a>\]

In this lab we will explore techniques for model selection that make use
of validation data. Data that isn't seen by the model in the learning
(or fitting) phase, but is used to *validate* our choice of model from
amoungst the different designs we have selected.

In machine learning, we are looking to minimise the value of our
objective function $E$ with respect to its parameters $\mappingVector$.
We do this by considering our training data. We minimize the value of
the objective function as it's observed at each training point. However
we are really interested in how the model will perform on future data.
For evaluating that we choose to *hold out* a portion of the data for
evaluating the quality of the model.

We will review the different methods of model selection on the Olympics
marathon data. Firstly we import the Olympic marathon data.

## Olympic Marathon Data

<table>
<tr>
<td width="70%">
-   Gold medal times for Olympic Marathon since 1896.
-   Marathons before 1924 didn't have a standardised distance.
-   Present results using pace per km.
-   In 1904 Marathon was badly organised leading to very slow times.

</td>
<td width="30%">
<img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/Stephen_Kiprotich.jpg" style="width:100%">
<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>
</td>
</tr>
</table>
The first thing we will do is load a standard data set for regression
modelling. The data consists of the pace of Olympic Gold Medal Marathon
winners for the Olympics from 1896 to present. First we load in the data
and plot.

In [None]:
import numpy as np
import pods

In [None]:
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())

In [None]:
import matplotlib.pyplot as plt
import teaching_plots as plot
import mlai

In [None]:

xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, 
                  filename='../slides/diagrams/datasets/olympic-marathon.svg', 
                  transparent=True, 
                  frameon=True)

<img src="http://inverseprobability.com/talks/slides/../slides/diagrams/datasets/olympic-marathon.svg" class="" align="" style="vertical-align:middle;">

Figure: <i>Olympic marathon pace times since 1892.</i>

Things to notice about the data include the outlier in 1904, in this
year, the olympics was in St Louis, USA. Organizational problems and
challenges with dust kicked up by the cars following the race meant that
participants got lost, and only very few participants completed.

More recent years see more consistently quick marathons.

## Validation on the Olympic Marathon Data \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/validation-olympic-fit.md" target="_blank" >edit</a>\]

The first thing we'll do is fit a standard linear model to the data. We
recall from previous lectures and lab classes that to do this we need to
solve the system $$
\basisMatrix^\top \basisMatrix \mappingVector = \basisMatrix^\top \dataVector
$$ for $\mappingVector$ and use the resulting vector to make predictions
at the training points and test points, $$
\mappingFunctionVector = \basisMatrix \mappingVector.
$$ The prediction function can be used to compute the objective
function, $$
E(\mappingVector) = \sum_{i}^{\numData} (\dataScalar_i - \mappingVector^\top\phi(\dataVector_i))^2
$$ by substituting in the prediction in vector form we have $$
E(\mappingVector) =  (\dataVector - \mappingFunctionVector)^\top(\dataVector - \mappingFunctionVector)
$$

### Question 1

In this question you will construct some flexible general code for
fitting linear models.

Create a python function that computes $\basisMatrix$ for the linear
basis,
$$\basisMatrix = \begin{bmatrix} \dataVector & \mathbf{1}\end{bmatrix}$$
Name your function `linear`. `Phi` should be in the form of a *design
matrix* and `x` should be in the form of a `numpy` two dimensional array
with $\numData$ rows and 1 column Calls to your function should be in
the following form:

`Phi = linear(x)`

Create a python function that accepts, as arguments, a python function
that defines a basis (like the one you've just created called `linear`)
as well as a set of inputs and a vector of parameters. Your new python
function should return a prediction. Name your function `prediction`.
The return value `f` should be a two dimensional `numpy` array with
$\numData$ rows and $1$ column, where $\numData$ is the number of data
points. Calls to your function should be in the following form:

`f = prediction(w, x, linear)`

Create a python function that computes the sum of squares objective
function (or error function). It should accept your input data (or
covariates) and target data (or response variables) and your parameter
vector `w` as arguments. It should also accept a python function that
represents the basis. Calls to your function should be in the following
form:

`e = objective(w, x, y, linear)`

Create a function that solves the linear system for the set of
parameters that minimizes the sum of squares objective. It should accept
input data, target data and a python function for the basis as the
inputs. Calls to your function should be in the following form:

`w = fit(x, y, linear)`

Fit a linear model to the olympic data using these functions and plot
the resulting prediction between 1890 and 2020. Set the title of the
plot to be the error of the fit on the *training data*.

*15 marks*

In [None]:
# Write code for your answer to Question 1 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## Polynomial Fit: Training Error

### Question 2

In this question we extend the code above to a non- linear basis (a
quadratic function).

Start by creating a python-function called `quadratic`. It should
compute the quadratic basis. $$
\basisMatrix = \begin{bmatrix} \mathbf{1} & \dataVector & \dataVector^2\end{bmatrix}
$$ It should be called in the following form:

`Phi = quadratic(x)`

Use this to compute the quadratic fit for the model, again plotting the
result titled by the error.

*10 marks*

In [None]:
# Write code for your answer to Question 2 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## Polynomial Fits to Olympics Data

In [None]:
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('olympic_LM_polynomial_number{num_basis:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            num_basis=IntSlider(1, 1, max_basis, 1))

<img src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_LM_polynomial_number026.svg" class="" align="80%" style="vertical-align:middle;">

Figure: <i>Polynomial fit to olympic data with 26 basis functions.</i>

## Hold Out Validation on Olympic Marathon Data \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/hold-out-validation-olympics.md" target="_blank" >edit</a>\]

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('olympic_val_extra_LM_polynomial_number{num_basis:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            num_basis=IntSlider(1, 1, max_basis, 1))

<img src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_val_extra_LM_polynomial_number011.svg" class="" align="80%" style="vertical-align:middle;">

Figure: <i>Olympic marathon data with validation error for
extrapolation.</i>

## Extrapolation

## Interpolation

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('olympic_val_inter_LM_polynomial_number{num_basis:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            num_basis=IntSlider(1, 1, max_basis, 1))

<img src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/olympic_val_inter_LM_polynomial_number011.svg" class="" align="80%" style="vertical-align:middle;">

Figure: <i>Olympic marathon data with validation error for
interpolation.</i>

## Choice of Validation Set

## Hold Out Data

You have a conclusion as to which model fits best under the training
error, but how do the two models perform in terms of validation? In this
section we consider *hold out* validation. In hold out validation we
remove a portion of the training data for *validating* the model on. The
remaining data is used for fitting the model (training). Because this is
a time series prediction, it makes sense for us to hold out data at the
end of the time series. This means that we are validating on future
predictions. We will hold out data from after 1980 and fit the model to
the data before 1980.

In [None]:
# select indices of data to 'hold out'
indices_hold_out = np.flatnonzero(x>1980)

# Create a training set
x_train = np.delete(x, indices_hold_out, axis=0)
y_train = np.delete(y, indices_hold_out, axis=0)

# Create a hold out set
x_valid = np.take(x, indices_hold_out, axis=0)
y_valid = np.take(y, indices_hold_out, axis=0)

### Question 3

For both the linear and quadratic models, fit the model to the data up
until 1980 and then compute the error on the held out data (from 1980
onwards). Which model performs better on the validation data?

*10 marks*

In [None]:
# Write code for your answer to Question 3 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## Richer Basis Set

Now we have an approach for deciding which model to retain, we can
consider the entire family of polynomial bases, with arbitrary degrees.

### Question 4

Now we are going to build a more sophisticated form of basis function,
one that can accept arguments to its inputs (similar to those we used in
[this lab](./week4.ipynb)). Here we will start with a polynomial basis.

In [None]:
def polynomial(x, degree, loc, scale):
    degrees =np.arange(degree+1)
    return ((x-loc)/scale)**degrees

The basis as we've defined it has three arguments as well as the input.
The degree of the polynomial, the scale of the polynomial and the
offset. These arguments need to be passed to the basis functions
whenever they are called. Modify your code to pass these additional
arguments to the python function for creating the basis. Do this for
each of your functions `predict`, `fit` and `objective`. You will find
`*args` (or `**kwargs`) useful.

Write code that tries to fit different models to the data with
polynomial basis. Use a maximum degree for your basis from 0 to 17. For
each polynomial store the *hold out validation error* and the *training
error*. When you have finished the computation plot the hold out error
for your models and the training error for your p. When computing your
polynomial basis use `offset=1956.` and `scale=120.` to ensure that the
data is mapped (roughly) to the -1, 1 range.

Which polynomial has the minimum training error? Which polynomial has
the minimum validation error?

*25 marks*

In [None]:
# Write code for your answer to Question 4 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## Leave One Out Validation \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/loo-validation-olympics.md" target="_blank" >edit</a>\]

In [None]:
from ipywidgets import IntSlider
import pods

In [None]:
pods.notebook.display_plots('olympic_loo{part:0>3}_LM_polynomial_number{num_basis:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            num_basis=IntSlider(1, 1, max_basis, 1), 
                            part=IntSlider(0, 0, x.shape[0], 1))

Hold out validation uses a portion of the data to hold out and a portion
of the data to train on. There is always a compromise between how much
data to hold out and how much data to train on. The more data you hold
out, the better the estimate of your performance at 'run-time' (when the
model is used to make predictions in real applications). However, by
holding out more data, you leave less data to train on, so you have a
better validation, but a poorer quality model fit than you could have
had if you'd used all the data for training. Leave one out cross
validation leaves as much data in the training phase as possible: you
only take *one point* out for your validation set. However, if you do
this for hold-out validation, then the quality of your validation error
is very poor because you are testing the model quality on one point
only. In *cross validation* the approach is to improve this estimate by
doing more than one model fit. In *leave one out cross validation* you
fit $\numData$ different models, where $\numData$ is the number of your
data. For each model fit you take out one data point, and train the
model on the remaining $n-1$ data points. You validate the model on the
data point you've held out, but you do this $\numData$ times, once for
each different model. You then take the *average* of all the $\numData$
badly estimated hold out validation errors. The average of this estimate
is a good estimate of performance of those models on the test data.

### Question 5

Write code that computes the *leave one out* validation error for the
olympic data and the polynomial basis. Use the functions you have
created above: `objective`, `fit`, `polynomial`. Compute the
*leave-one-out* cross validation error for basis functions containing a
maximum degree from 0 to 17.

*20 marks*

In [None]:
# Write code for your answer to Question 5 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## $k$-fold Cross Validation \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/k-fold-validation-olympics.md" target="_blank" >edit</a>\]

In [None]:
from ipywidgets import IntSlider
import pods

In [None]:
pods.notebook.display_plots('olympic_{num_parts}'.format(num_parts=num_parts) + 'cv{part:0>2}_LM_polynomial_number{number:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            part=IntSlider(0,0,5,1),
                            number=IntSlider(1, 1, max_basis, 1))

Leave one out cross validation produces a very good estimate of the
performance at test time, and is particularly useful if you don't have a
lot of data. In these cases you need to make as much use of your data
for model fitting as possible, and having a large hold out data set (to
validate model performance) can have a significant effect on the size of
the data set you have to fit your model, and correspondingly, the
complexity of the model you can fit. However, leave one out cross
validation involves fitting $\numData$ models, where $\numData$ is your
number of training data. For the olympics example, this is only 27 model
fits, but in practice many data sets consist thousands or millions of
data points, and fitting many millions of models for estimating
validation error isn't really practical. One option is to return to
*hold out* validation, but another approach is to perform $k$-fold cross
validation. In $k$-fold cross validation you split your data into $k$
parts. Then you use $k-1$ of those parts for training, and hold out one
part for validation. Just like we did for the hold out validation above.
In *cross* validation, however, you repeat this process. You swap the
part of the data you just used for validation back in to the training
set and select another part for validation. You then fit the model to
the new training data and validate on the portion of data you've just
extracted. Each split of training/validation data is called a *fold* and
since you do this process $k$ times, the procedure is known as $k$-fold
cross validation. The term *cross* refers to the fact that you cross
over your validation portion back into the training data every time you
perform a fold.

### Question 6

Perform $k$-fold cross validation on the olympic data with your
polynomial basis. Use $k$ set to 5 (e.g. five fold cross validation). Do
the different forms of validation select different models? Does five
fold cross validation always select the same model?

*Note*: The data doesn't divide into 5 equal size partitions for the
five fold cross validation error. Don't worry about this too much. Two
of the partitions will have an extra data point. You might find
`np.random.permutation?` useful.

*20 marks*

In [None]:
# Write code for your answer to Question 6 in this box
# provide the answers so that the code runs correctly otherwise you will loose marks!



## Bias Variance Decomposition \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/bias-variance-dilemma.md" target="_blank" >edit</a>\]

Expected test error for different variations of the *training data*
sampled from, $\Pr(\dataVector, \dataScalar)$
$$\mathbb{E}\left[ \left(\dataScalar - \mappingFunction^*(\dataVector)\right)^2 \right]$$
Decompose as
$$\mathbb{E}\left[ \left(\dataScalar - \mappingFunction(\dataVector)\right)^2 \right] = \text{bias}\left[\mappingFunction^*(\dataVector)\right]^2 + \text{variance}\left[\mappingFunction^*(\dataVector)\right] +\sigma^2$$

-   Given by $$\text{bias}\left[\mappingFunction^*(\dataVector)\right] =
    \mathbb{E}\left[\mappingFunction^*(\dataVector)\right] * \mappingFunction(\dataVector)$$
-   Error due to bias comes from a model that's too simple.

-   Given by
    $$\text{variance}\left[\mappingFunction^*(\dataVector)\right] = \mathbb{E}\left[\left(\mappingFunction^*(\dataVector) - \mathbb{E}\left[\mappingFunction^*(\dataVector)\right]\right)^2\right]$$
-   Slight variations in the training set cause changes in the
    prediction. Error due to variance is error in the model due to an
    overly complex model.

## Bias vs Variance Error Plots \[<a href="https://github.com/lawrennd/talks/edit/gh-pages/_ml/includes/bias-variance-plots.md" target="_blank" >edit</a>\]

Helper function for sampling data from two different classes.

In [None]:
import numpy as np

In [None]:
def create_data(per_cluster=30):
    """Create a randomly sampled data set
    
    :param per_cluster: number of points in each cluster
    """
    X = []
    y = []
    scale = 3
    prec = 1/(scale*scale)
    pos_mean = [[-1, 0],[0,0.5],[1,0]]
    pos_cov = [[prec, 0.], [0., prec]]
    neg_mean = [[0, -0.5],[0,-0.5],[0,-0.5]]
    neg_cov = [[prec, 0.], [0., prec]]
    for mean in pos_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))
        y.append(np.ones((per_class, 1)))
    for mean in neg_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))
        y.append(np.zeros((per_class, 1)))
    return np.vstack(X), np.vstack(y).flatten()

Helper function for plotting the decision boundary of the SVM.

In [None]:
def plot_contours(ax, cl, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    :param ax: matplotlib axes object
    :param cl: a classifier
    :param xx: meshgrid ndarray
    :param yy: meshgrid ndarray
    :param params: dictionary of params to pass to contourf, optional
    """
    Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot decision boundary and regions
    out = ax.contour(xx, yy, Z, 
                     levels=[-1., 0., 1], 
                     colors='black', 
                     linestyles=['dashed', 'solid', 'dashed'])
    out = ax.contourf(xx, yy, Z, 
                     levels=[Z.min(), 0, Z.max()], 
                     colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])
    return out

In [None]:
import mlai
import os

In [None]:
def decision_boundary_plot(models, X, y, axs, filename, titles, xlim, ylim):
    """Plot a decision boundary on the given axes
    
    :param axs: the axes to plot on.
    :param models: the SVM models to plot
    :param titles: the titles for each axis
    :param X: input training data
    :param y: target training data"""
    for ax in axs.flatten():
        ax.clear()
    X0, X1 = X[:, 0], X[:, 1]
    if xlim is None:
        xlim = [X0.min()-1, X0.max()+1]
    if ylim is None:
        ylim = [X1.min()-1, X1.max()+1]
    xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),
                         np.arange(ylim[0], ylim[1], 0.02))
    for cl, title, ax in zip(models, titles, axs.flatten()):
        plot_contours(ax, cl, xx, yy,
                      cmap=plt.cm.coolwarm, alpha=0.8)
        ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)
        ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)
        ax.set_xlim(xlim)
        ax.set_ylim(ylim)
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(title)
        mlai.write_figure(os.path.join(filename),
                          figure=fig,
                          transparent=True)
    return xlim, ylim

In [None]:
import matplotlib
font = {'family' : 'sans',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)
import matplotlib.pyplot as plt

In [None]:
# Create an instance of SVM and fit the data. 
C = 100.0  # SVM regularization parameter
gammas = [0.001, 0.01, 0.1, 1]


per_class=30
num_samps = 20
# Set-up 2x2 grid for plotting.
fig, ax = plt.subplots(1, 4, figsize=(10,3))
xlim=None
ylim=None
for samp in range(num_samps):
    X, y=create_data(per_class)
    models = []
    titles = []
    for gamma in gammas:
        models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))
        titles.append('$\gamma={}$'.format(gamma))
    models = (cl.fit(X, y) for cl in models)
    xlim, ylim = decision_boundary_plot(models, X, y, 
                           axs=ax, 
                           filename='../slides/diagrams/ml/bias-variance{samp:0>3}.svg'.format(samp=samp), 
                           titles=titles,
                          xlim=xlim,
                          ylim=ylim)

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('bias-variance{samp:0>3}.svg', 
                            directory='../slides/diagrams/ml', 
                            samp=IntSlider(0,0,10,1))

<img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bias-variance000.png" style="width:80%"><img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bias-variance010.png" style="width:80%">

Figure: <i>In each figure the more simple model is on the left, and the
more complex model is on the right. Each fit is done to a different
version of the data set. The simpler model is more consistent in its
errors (bias error), whereas the more complex model is varying in its
errors (variance error).</i>

\addreading{@Rogers:book11}{Section 1.5}
\reading
# References