In [1]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython

config_ipython()
setup_matplotlib()
set_css_style()

# Evaluating a model

The performance of the model you've built should always be evaluated on a separate set of data which has not been seen (in any way) during the training phase. This ensures that the evaluation is unbiased and clean. 

## Concepts: how to in a nutshell

### Training and test set

When you have one model candidate only and at the end of your modelling efforts you're tasked with evaluating how good it is, you need to have the separate dataset, usually called the *test set* to run it on. The usual way to do this is to, at the start of any modelling work, take aside a portion of the whole dataset you're using to constitute this test set, and train the model on the rest, which gets called the *training set*. Usual splits are 70% training and 30% test, or 80/20, depends on how much data you have in the first place really, as you don't want to remove too much from the training set.

Note that this is really essential! Evaluating performance metrics on the training set will typically give you higher figures: the model has been trained there, so will have learned the patterns on these data. It's when you use new data that you see the actual performance and how good your model is in generalising.

Also note that in the spirit of not using the test set in the modelling phase at all to avoid polluting the subsequent evaluation, in the cases when you have rescaled the data in training and need to now do on the test set (because you need rescaling there as well), you need to use the metrics of the training set, not those of the test set! So say if you have subtracted the mean and divided by the standard deviation in the training set, the right thing to do is use mean and std of the training set on the test set as well. Otherwise you'd be using something that is 

### Training, test and validation sets

When you are evaluating which of several models performs best in your case, the way to go is to divide the original dataset into three parts:

* a *training* set to fit the models
* a *test* set to evaluate the final model
* a *validation* set to select the model

While in those cases when you are evaluating one model only you are fine with a split into training/test, when you have multiple rivalling models to choose from you need a split into three. A typical way is to split into 50%/25%/25% but it really varies depending on the size you start with in the first place. Rivalling models can actually be different algorithms employed, or the same algorithm but with different parameters, so this procedure is also used when you need to fine-tune.

The test set is to be kept aside and only used at the very end of the analysis, to estimate the generalisation error at the end. The validation set is used to assess the performance of the specific model you are evaluating. The model you will run on the test set at the end will be the one with the best performance in the validation phase. 

If for these types of cases you'd use only a 2-way split you'd risk underestimating the error and overfitting, because you'd have chosen the model by tuning it on a set and it may not generalise well to unseen data. 

The training set is used to train each model (or each combination of parameters) on the data; the validation phase then assess, for each of these models/tunes, the performance. After it, you'd have selected the one giving the best result. This phase can for instance encompass a cross-validation. 
The testing phase is then needed to have an unbiased estimation of the generalisation error, as the model will run on fresh, independent data. 

What to typically do when you have finished the whole procedure in order to have a usable model, is to re-train it on training+validation sets and then test it on the test set.

## Techniques: Cross Validation

### What do you do in cross-validation?

Cross validation is a technique for validation the performance of a model. In its basic form it basically consists in dividing the original data sample into sets, picking one of them as the training set and validating the performance on the other, the test set, repeating the procedure multiple times with different splits of the original set. Eventually, the results are averaged. 

The procedure results in a better outcome than the simple training/test split because it allows for a control of the error by the averaging procedure.

There are multiple categories of cross-validation techniques. 

### Types of Cross Validation techniques

#### $k$-fold cross validation

In this case, the original data set is split into $k$ equally-sized sub-samples and each subset is in turn used as the test set while the remaining $k-1$ constitute the training set. This way you repeat the procedure $k$ times (called *folds*), one for each test set. At the end, an average of all validation results is computed. This way, all sample points in the original set are used both for training and for testing (in different folds). 

If folds are selected such that each set contains the same percentage of samples in each target class (or dependent variable in the case of regression), it is called a *stratified k-fold cross validation*.

#### Leave-one-out

Is the variation of the $k$-fold when $k=n$, $n$ being the number of samples, meaning you are doing samples of one data point so it's one against everyone else. 

#### Leave-$p$-out

It is the same as the leave-one-out but the test set is constituted by $p$ of the samples, but all possible splits are calculated, meaning that all possible situations where $p$ items are selected as the test set are built. It is a very expensive procedure and the comprehensive way to consider all possible splits. Note that a $k$-fold is an approximation of this one as not all splits are considered (because the original set is preliminarly partitioned into $k$ subsets).

#### Hold-out

It is a $k$-fold where $k=2$ but points are randomly assigned to each of the two sets, with typically the training set being bigger. It is a very loose way to do a cross validation, the only real advantage beyond speed being the fact that both sets are large. 


#### Repeated random sub-sampling validation

Also called a Monte Carlo cross validation, this method splits the original data set into training and test randomly at each iteration. The advantage of this method over a $k$ fold is that the number of points into training and test parts does not depend on the number of folds chosen; the disadvantage of this method is that some samples may never be selected in the test set, or selected multiple times over the iterations. 