Generalization and Neural Networks
==================================

### [Neil D. Lawrence](http://inverseprobability.com)

### 2021-01-26

**Abstract**: This lecture will cover generalization in machine learning
with a particular focus on neural architectures. We will review
classical generalization and explore what’s different about neural
network models.

$$
$$

<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!---->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->
<!--

-->

Setup
-----

First we download some libraries and files to support the notebook.

In [None]:
import urllib.request

In [None]:
urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/ndlml.py','ndlml.py')

In [None]:
urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/teaching_plots.py','teaching_plots.py')

In [None]:
urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/gp_tutorial.py','gp_tutorial.py')

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 22})

<!--setupplotcode{import seaborn as sns
sns.set_style('darkgrid')
sns.set_context('paper')
sns.set_palette('colorblind')}-->

pods
----

In Sheffield we created a suite of software tools for ‘Open Data
Science’. Open data science is an approach to sharing code, models and
data that should make it easier for companies, health professionals and
scientists to gain access to data science techniques.

You can also check this blog post on [Open Data
Science](http://inverseprobability.com/2014/07/01/open-data-science).

The software can be installed using

In [None]:
%pip install --upgrade git+https://github.com/sods/ods

from the command prompt where you can access your python installation.

The code is also available on github:
<a href="https://github.com/sods/ods" class="uri">https://github.com/sods/ods</a>

Once `pods` is installed, it can be imported in the usual manner.

In [None]:
import pods

Bias Variance Decomposition
---------------------------

The bias-variance decomposition considers the expected test error for
different variations of the *training data* sampled from,
$\Pr(\mathbf{ y}, y)$ $$
\mathbb{E}\left[ \left(y- f^*(\mathbf{ y})\right)^2 \right].
$$ This can be decomposed into two parts, $$
\mathbb{E}\left[ \left(y- f(\mathbf{ y})\right)^2 \right] = \text{bias}\left[f^*(\mathbf{ y})\right]^2 + \text{variance}\left[f^*(\mathbf{ y})\right] +\sigma^2,
$$ where the bias is given by $$
  \text{bias}\left[f^*(\mathbf{ y})\right] =
\mathbb{E}\left[f^*(\mathbf{ y})\right] * f(\mathbf{ y})
$$ and it summarizes error that arises from the model’s inability to
represent the underlying complexity of the data. For example, if we were
to model the marathon pace of the winning runner from the Olympics by
computing the average pace across time, then that model would exhibit
*bias* error because the reality of Olympic marathon pace is it is
changing (typically getting faster).

The variance term is given by $$
  \text{variance}\left[f^*(\mathbf{ y})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ y}) - \mathbb{E}\left[f^*(\mathbf{ y})\right]\right)^2\right].
  $$ The variance term is often described as arising from a model that
is too complex, but we have to be careful with this idea. Is the model
really too complex relative to the real world that generates the data?
The real world is a complex place, and it is rare that we are
constructing mathematical models that are more complex than the world
around us. Rather, the ‘too complex’ refers to ability to estimate the
parameters of the model given the data we have. Slight variations in the
training set cause changes in prediction.

Models that exhibit high variance are sometimes said to ‘overfit’ the
data whereas models that exhibit high bias are sometimes described as
‘underfitting’ the data.

Bias vs Variance Error Plots
----------------------------

Helper function for sampling data from two different classes.

In [None]:
import numpy as np

In [None]:
def create_data(per_cluster=30):
    """Create a randomly sampled data set
    
    :param per_cluster: number of points in each cluster
    """
    X = []
    y = []
    scale = 3
    prec = 1/(scale*scale)
    pos_mean = [[-1, 0],[0,0.5],[1,0]]
    pos_cov = [[prec, 0.], [0., prec]]
    neg_mean = [[0, -0.5],[0,-0.5],[0,-0.5]]
    neg_cov = [[prec, 0.], [0., prec]]
    for mean in pos_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))
        y.append(np.ones((per_class, 1)))
    for mean in neg_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))
        y.append(np.zeros((per_class, 1)))
    return np.vstack(X), np.vstack(y).flatten()

Helper function for plotting the decision boundary of the SVM.

In [None]:
def plot_contours(ax, cl, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    :param ax: matplotlib axes object
    :param cl: a classifier
    :param xx: meshgrid ndarray
    :param yy: meshgrid ndarray
    :param params: dictionary of params to pass to contourf, optional
    """
    Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot decision boundary and regions
    out = ax.contour(xx, yy, Z, 
                     levels=[-1., 0., 1], 
                     colors='black', 
                     linestyles=['dashed', 'solid', 'dashed'])
    out = ax.contourf(xx, yy, Z, 
                     levels=[Z.min(), 0, Z.max()], 
                     colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])
    return out

In [None]:
urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/mlai.py','mlai.py')

In [None]:
import mlai
import os

In [None]:
def decision_boundary_plot(models, X, y, axs, filename, directory, titles, xlim, ylim):
    """Plot a decision boundary on the given axes
    
    :param axs: the axes to plot on.
    :param models: the SVM models to plot
    :param titles: the titles for each axis
    :param X: input training data
    :param y: target training data"""
    for ax in axs.flatten():
        ax.clear()
    X0, X1 = X[:, 0], X[:, 1]
    if xlim is None:
        xlim = [X0.min()-1, X0.max()+1]
    if ylim is None:
        ylim = [X1.min()-1, X1.max()+1]
    xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),
                         np.arange(ylim[0], ylim[1], 0.02))
    for cl, title, ax in zip(models, titles, axs.flatten()):
        plot_contours(ax, cl, xx, yy,
                      cmap=plt.cm.coolwarm, alpha=0.8)
        ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)
        ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)
        ax.set_xlim(xlim)
        ax.set_ylim(ylim)
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(title)
        mlai.write_figure(filename,
                          directory=directory,
                          figure=fig,
                          transparent=True)
    return xlim, ylim

In [None]:
import matplotlib
font = {'family' : 'sans',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)
import matplotlib.pyplot as plt

In [None]:
from sklearn import svm

In [None]:
# Create an instance of SVM and fit the data. 
C = 100.0  # SVM regularization parameter
gammas = [0.001, 0.01, 0.1, 1]


per_class=30
num_samps = 20
# Set-up 2x2 grid for plotting.
fig, ax = plt.subplots(1, 4, figsize=(10,3))
xlim=None
ylim=None
for samp in range(num_samps):
    X, y=create_data(per_class)
    models = []
    titles = []
    for gamma in gammas:
        models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))
        titles.append('$\gamma={}$'.format(gamma))
    models = (cl.fit(X, y) for cl in models)
    xlim, ylim = decision_boundary_plot(models, X, y, 
                           axs=ax, 
                           filename='bias-variance{samp:0>3}.svg'.format(samp=samp), 
                           directory='./ml'
                           titles=titles,
                          xlim=xlim,
                          ylim=ylim)

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('bias-variance{samp:0>3}.svg', 
                            directory='./ml', 
                            samp=IntSlider(0,0,10,1))

<!---->

<img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bias-variance000.png" style="width:80%"><img class="" src="http://inverseprobability.com/talks/slides/../slides/diagrams/ml/bias-variance010.png" style="width:80%">

Figure: <i>In each figure the simpler model is on the left, and the more
complex model is on the right. Each fit is done to a different version
of the data set. The simpler model is more consistent in its errors
(bias error), whereas the more complex model is varying in its errors
(variance error).</i>

Bias variance dilemma
<a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1992.4.1.1" class="uri">https://www.mitpressjournals.org/doi/abs/10.1162/neco.1992.4.1.1</a>

bootstrap

Bootstrap Predication and Bayesian Misspecified Models:
<a href="https://www.jstor.org/stable/3318894#metadata_info_tab_contents" class="uri">https://www.jstor.org/stable/3318894#metadata_info_tab_contents</a>

Edwin Fong and Chris Holmes: On the Marginal Likelihood and Cross
Validation
<a href="https://arxiv.org/abs/1905.08737" class="uri">https://arxiv.org/abs/1905.08737</a>

The lack of a priori distinction between learning algorithms (No free
lunch)
<a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341" class="uri">https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341</a>
<a href="https://link.springer.com/chapter/10.1007/978-1-4471-0123-9_3" class="uri">https://link.springer.com/chapter/10.1007/978-1-4471-0123-9_3</a>

David Hogg’s lecture
<a href="https://speakerdeck.com/dwhgg/linear-regression-with-huge-numbers-of-parameters" class="uri">https://speakerdeck.com/dwhgg/linear-regression-with-huge-numbers-of-parameters</a>

Belkin on Bias/Variance
<a href="https://www.pnas.org/content/116/32/15849.short" class="uri">https://www.pnas.org/content/116/32/15849.short</a>
<a href="https://www.pnas.org/content/117/20/10625" class="uri">https://www.pnas.org/content/117/20/10625</a>

Belkin Talk:
<a href="http://www.ipam.ucla.edu/abstract/?tid=15552&amp;pcode=GLWS4" class="uri">http://www.ipam.ucla.edu/abstract/?tid=15552&amp;pcode=GLWS4</a>

The Deep Bootstrap
<a href="https://twitter.com/PreetumNakkiran/status/1318007088321335297?s=20" class="uri">https://twitter.com/PreetumNakkiran/status/1318007088321335297?s=20</a>

Aki Vehtari on Leave One Out Uncertainty:
<a href="https://arxiv.org/abs/2008.10296" class="uri">https://arxiv.org/abs/2008.10296</a>
(check for his references).

Large models and memorisation:
<a href="https://arxiv.org/abs/2012.07805" class="uri">https://arxiv.org/abs/2012.07805</a>

Double Descent
--------------

One of Breiman’s ideas for improving predictive performance is known as
bagging (Breiman, 1996). The idea is to train a number of models on the
data such that they overfit (high variance). Then average the
predictions of these models. The models are trained on different
bootstrap samples (Efron, 1979) and their predictions are aggregated
giving us the acronym, Bagging. By combining decision trees with
bagging, we recover random forests (Breiman, 2001).

Bias and variance can also be estimated through Efron’s bootstrap
(Efron, 1979), and the traditional view has been that there’s a form of
Goldilocks effect, where the best predictions are given by the model
that is ‘just right’ for the amount of data available. Not to simple,
not too complex. The idea is that bias decreases with increasing model
complexity and variance increases with increasing model complexity.
Typically plots begin with the Mummy bear on the left (too much bias)
end with the Daddy bear on the right (too much variance) and show a dip
in the middle where the Baby bear (just) right finds themselves.

The Daddy bear is typically positioned at the point where the model is
able to exactly interpolate the data. For a generalized linear model
(McCullagh and Nelder, 1989), this is the point at which the number of
parameters is equal to the number of data[1]. But the modern empirical
finding is that when we move beyond Daddy bear, into the dark forest of
the massively overparameterized model we can achieve good
generalization.

As Zhang et al. (2017) starkly illustrated with their random labels
experiment, within the dark forest there are some terrible places, big
bad wolves of overfitting that will gobble up your model. But as
empirical evidence shows there is also a safe and hospitable Grandma’s
house where these highly overparameterized models are safely consumed.
Fundamentally, it must be about the route you take through the forest,
and the precautions you take to ensure the wolf doesn’t see where you’re
going and beat you to the door.

There are two implications of this empirical result. Firstly, that there
is a great deal of new theory that needs to be developed. Secondly, that
theory is now obliged to conflate two aspects to modelling that we
generally like to keep separate: the model and the algorithm.

Classical statistical theory around predictive generalization focusses
specifically on the class of models that is being used for data fitting.
Historically, whether that theory follows a Fisher-aligned estimation
approach (see e.g. Vapnik (1998)) or model-based Bayesian approach (see
e.g. Ghahramani (2015)), neither is fully equipped to deal with these
new circumstances because, to continue our rather tortured analogy,
these theories provide us with a characterization of the *destination*
of the algorithm, and seek to ensure that we reach that destination.
Modern machine learning requires theories of the *journey* and what our
route through the forest should be.

Crucially, the destination is always associated with 100% accuracy on
the training set. An objective that is always achievable for the
overparameterized model.

Intuitively, it seems that a highly overparameterized model places
Grandma’s house on the edge of the dark forest. Making it easily and
quickly accessible to the algorithm. The larger the model, the more
exposed Grandma’s house becomes. Perhaps this is due to some form of
blessing of dimensionality brings Grandma’s house closer to the edge of
the forest in a high dimensional setting. Really, we should think of
Grandma’s house as a low dimensional manifold of destinations that are
safe. A path through the forest where the wolf of overfitting doesn’t
venture. In the GLM case, we know already that when the number of
parameters matches the number of data there is precisely one location in
parameter space where accuracy on the training data is 100%. Our
previous misunderstanding of generalization stemmed from the fact that
(seemingly) it is highly unlikely that this single point is a good place
to be from the perspective of generalization. Additionally, it is often
difficult to find. Finding the precise polynomial coefficients in a
least squares regression to exactly fit the basis to a small data set
such as the Olympic marathon data requires careful consideration of the
numerical properties and an orthogonalization of the design matrix
(Lawson and Hanson, 1995).

It seems that with a highly overparameterized model, these locations
become easier to find and they provide good generalization properties.
In machine learning this is known as the “double descent phenomenon”
(see e.g. Belkin et al. (2019)).

[1] Assuming we are ignoring parameters in the link function and the
distribution function.

Thanks!
-------

For more information on these subjects and more you might want to check
the following resources.

-   twitter: [@lawrennd](https://twitter.com/lawrennd)
-   podcast: [The Talking Machines](http://thetalkingmachines.com)
-   newspaper: [Guardian Profile
    Page](http://www.theguardian.com/profile/neil-lawrence)
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)

References
----------

Belkin, M., Hsu, D., Ma, S., Soumik Mandal, 2019. Reconciling modern
machine-learning practice and the classical bias-variance trade-off.
Proc. Natl. Acad. Sci. USA 116, 15849–15854.

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
<https://doi.org/10.1023/A:1010933404324>

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123–140.
<https://doi.org/10.1007/BF00058655>

Efron, B., 1979. Bootstrap methods: Another look at the jackkife. Annals
of Statistics 7, 1–26.

Ghahramani, Z., 2015. Probabilistic machine learning and artificial
intelligence. Nature 452–459.

Lawson, C.L., Hanson, R.J., 1995. Solving least squares problems. SIAM.
<https://doi.org/10.1137/1.9781611971217>

McCullagh, P., Nelder, J.A., 1989. Generalized linear models, 2nd ed.
Chapman; Hall.

Vapnik, V.N., 1998. Statistical learning theory. wiley, New York.

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2017.
Understanding deep learning requires rethinking generalization, in:
https://openreview.net/forum?id=Sy8gdB9xx (Ed.), International
Conference on Learning Representations.