# Generalization in Deep Learning

In previous chapters, examined algorithms for maximising the likelihood of the observed training labels for regression and classification tasks. However, we are generally interested in optimising our models to maximise the likelihood of the training labels insofar as it allows us to discover general patterns in the underlying data. It is serindipitous that in general, machine learning algorithms trained on stochastic gradient descent generalize rather well to numerous problems, unfortunately, there is no rigorous statistical explanation for why precisely this is the case. Understanding the theory and pracise of machine learning is rapidly evolving. 

In general, the current theory of deep learning has uncovered certain interesting avenues and a few interesting results, but nothing like a general overarching theory. Lacking are explanations of:
1. Why we are able to optimize neural networks?
2. How models learned by gradient descent generalize so well, even on high-dimensional tasks.

However, (1) is rarely ap problem, as we can always find parameters that will fit all of our training data; understanding the generalization is a much bigger problem. While there is no grand theory for underdstanding this, there is a body of heuristics and techniques which can be applied to tackle it...

### Overfitting and Regularization

Given a finite training set, a model relies on certain assumptions, known as _inductive biases_, results in models with a preference for certain properties. E.g. a deep MLP has an inductive bias for building up a complicated function as a composition of simpler ones. 

With machine learning models encoding inductive biases, our approach to training them consists of two phases. 1. fit the training data, 2. evaluate the peformance of the model on a test dataset witheld from trainng; that is to say, estimate the _generalization_ error. The difference between the fit on the training data and the fit to the test data is called the generalization gap. When the generalization gap is large, we typically say that our model has been overfit. We can rectify overfitting by reducing the number of learned parameters in our model (i.e. reduce model complexity). 

What is unusual about deep learning, is that for, say, classification problems, our models are often complicated enough to perfectly fit all of the training data in the example. It might be naively assumed that this would be an outlier situation for model complexity, and that any improvements in model generalization must come from strict regularisation, either by reducing model complexity or by imposing strict regularization pentalites on the values which our parameters can adopt. However...

When training models for image recognition and text classification, we are often choosing among model architectures which can achieve arbitrarily low loss (and zero training error), meaning that any improvements must come from the reduction of overfitting. What is particularly bizzare, is that even though these models can fit the training data perfectly, we can often _reduce_ the generalization error by making the model even more expressive, adding parameters/layers, more training epochs etc. This behaviour can even be non-monotonic, with more model flexibility degrading model performance at first, and then improving it again - in a so-called "double descent" pattern. 

### Inspiration from Nonparametrics

In some sense, it can be useful to think of neural networks, as non-parametric models, although it is demonstrably true that they aren't really. One common theme among non-parametric models is that the model complexity grows as the amount of available data grow (like kernel methods). Non-parametrics models often just interpolate through all of their training data points, and yet still generalize quite well. It has even been shown that in some situations, kernel methods and neural networks with only one hidden layer are equivalent to one another.

### Early Stopping

While deep learning models can fit arbitrary data, even randomly labelled data, this is only behaviour which emerges after many iterations of training. When high-quality data is mixed in with poor quality data, deep learning models will generally fit the high quality data first, and not the randomly labelled data. It can be shown that when a model is fitting the high-quality data but not the low-quality, it has in fact generalized.

This fact motivates the _early stopping_ method, where instead of directly limiting the weights that the parameters of a model may take, one iinstead limits the number of epochs that the model can be trained for. In general monitoring the loss on the validation set after each epoch and halting the training if it has not decreased by some threshold. 

Generally, where classes or categories can be easily and unambiguously distinguished from one another, early stopping offers little benefit (e.g. distinguishing between cats and dogs), but when the data is noisy, or there is intrinsic uncertaintry over the categorisation) early stopping is essential. Think about it; training models until they interpolate noisy data is generally a bad idea. 

### Classical Regularization Methods for Deep Networks

Weight decay is a common regularization method, called ridge regularization if an l2 penalty is used, and lasso regularisation if an l1 penalty is used (l2 keeps all parameters, l1 performs automatic parameter ablation). However these alone may not be enough without allso including early stopping. 

### Summary

In contrast to classical models, deep learning models tend to actually be overparameterized, this scheme challenges many hard-held intuitions about how models _should_ behave if they are to generalize. Paradoxically, methods that improve models by reducing overfitting/generlization gap, both increase _and_ decrease model complexity. Why certain choices improve generalization is a question which remains open despite being attacked by many talented researchers. 

