# Generalization in Classification

As always, we are interested in learning Generalizations of patterns when training a model, it is not enough just to train on the training data, achieve high accuracy and call this "done"

This requirement raises a nymber of burning questions:
1. How many test examples do we need to make sure that test set is "representative" of the underlying population?
2. What happens if we keep evaluating models on the same test set repeatedly?
3. Why should we expect that fitting our linear model to the training set should necessarily fare any better than naive memorization?

Although we've already covered some aspects of this, this section will cover them more deeply, and introduce the basics of statistical learning theory. It turns out that we can often guarantee generalization _a priori_ though the results are a bit unsatisfying as we find that oftentimes _trillions_ of training examples are required. Thankfully, in practise we find that most deep learning models achieve a good degree of generalization with far fewer examples. 

## The Test Set

The epirical error of a classifier $f$ is evaluated on some fixed, orthogonal dataset set $D$. Because the dataset is only a sample, this can be seen as a sample of the true _population error_ which we are actually interested in computing. Estimating the true population error is an exercise in mean estimation, because the dataset we have can be seen as drawing random samples from the distribution. We can use the central limit theorem. 

The central limit theorem tells us that our estimate of the error approaches the true error at a rate of $\frac{1}{\sqrt{n}}$, where $n$ is the number of examples in our test dataset. This is important, as it shows us that to estimate our test error twice  as precisely, we must collect a dataset which is four times as large. This is about the _best_ we could hope for from a statistical perspective. 

Theoretically speaking, we often find that about 10,000-15,000 examples are needed in the test set to get a good estimate if the population error. Funnily enough, most available training sets for developing or benchmarking models contain about this many examples.

## Test Set Reuse

In some sense, we are now equipped to begin empirical ML research. Lets say we do everything properly, and develop a model $f_1$ using only a training and validation set, including deciding on a model architecture, hyperparameters, etc. Great. The next model you train doesn't formally have a "test set" as you have already used it in evaluating the performance of $f_1$. As you add more classifiers, it becomes hard to predict/say honestly whether the results of your "better" model are actually better, or if they are simply better through random chance on that particular test set. 

In practise, this may not be such a problem, but it is a reminder to be vigilant and disciplined when generating multiple models, especially when the stakes are high. Consult test sets as infrequently as possible, consider having multiple test sets if you can. Investigation into this phenomenon where models can change to match the test set even though they are not directly used in the training of the model is called _adaptive overfitting_. 

## Statistical Learning Theory

Is the mathematical subfield of machine learning aimed at finding fundamental answers to how/when/why machine learning algorithms generalize. 

## Summary

Test sets bedrock of evaulation, can theoretically put bounds on how close the error estimate from a given test set will be if it is a true test set. rarely are test sets true test sets, as they are used by researchers again and again, and once we use a test set to evaluate multiple models, controlling for false discovery can be a challenge. 