# Generalisation

Generalisation is the key aim and challenge of machine learning. We do not want to produce a model that is only capable of perfectly reproducing the results of a training set, but one which is capable of predicting with some degree of accuracy the properties of unseen data. 

Even when datasets are extremely large, we will almost never have a dataset which covers the space of all possible observations. The phenomenon of fitting closer to the training data than to the underlying distribution is known as overfitting. Techniques for dealing with overfitting are often called "regularization" which will be introduced in brief here.

## Training error and generalization error

Our initial assumption (known as the IID) assumption, is that our training data and validation data are drawn from identical and independent distributions. This is a broadly strong assumption (assuming good selection methods). 

It is important to distinguish between the training error $R_{emp}$, which is a statistical calculation on a small subset of real examples, and the generalisation error. The generalization error can be broadly thought of as the result of applying the model to an infinite number of previously-unseen examples drawn from the same underlying distribution. 

Since we do not know what this underlying distribution actually is, we cannot usually compute the generalisation error exactly. Instead, we tend to compute the error on a subset of the training data which is withheld from the training.

### Model Complexity

Often, for simple models with abundant training data (e.g. the linear example models already looked at) the training and generalisation models tend to be quite close, but this is rarely the case for more complicated models. In general, we cannot say that a model is performing better simply because it fits the training data better. It is entirely possible to give a ML model an arbitrary number of parameters, allowing it to fit the training data with arbitrary accuracy, which will not generalise at all.

### Underfitting or overfitting

Two common situations to be aware of when comparing the training and validation errors:

1. Training and validation errors are both high, and there is little gap between them. If the model is unable to reduce the training errors, this is an indication that the model is too small/simple to fit our data. The fact that the generalisation gap ($R_{emp} - R$) is small, we surmise that could use a model with more parameters. This situation is known as _underfitting_.

2. We also want to watch out for cases where our training error is small, but our validation error is very high. This is an indication of _overfitting_. It's probably worth noting that even some of the best machine learning models perform substantially better on their training data than on unseen data.

The fewer datapoints we have in our training set, the more likely we are to encounter overfitting. Moreover, in general, for deep learning and neural networks, more data never hurts.

### model selection

Only settle on final model after testing lots of different model architectures, training objectives, selected features etc. In principle, we should never touch the test data for this process, but in practise we have to. One approach is to split into a training, validation and test dataset, where the validation set is used to detect possible overfitting on the trainign dataset. If data is really scarce, cross-validation can be used, where a dataset is split into K subsets, and each model is trained on K-1 of these.

