# Model Selection / Model Evaluation

An appropriate use of model evaluation, model selection, and algorithm selection
techniques is essential to developping reliable machine learning models. This lecture will give an overview of different techniques that can be used for
each of these three subtasks and discusses the main advantages and disadvantages
of each technique with theoretical and applied examples.

## Introduction

A common goal to all applications of Machine Learning is making "good" predictions, it is one thing to fit a model to training data, the question is how to make sure that it will perform well on unknown future observations? Also we want to know what makes a "good" model, how do we choose the leanring algorithm that is better-suited for the task at hand ?

## 1. Model Evaluation - Essentials

Model evaluation is not just the end point of the machine learning pipeline,, data scientists need to be able to anticipate what evaluation techniques are fit for a given problem before handling any data, and this choice may also influence the way data needs to be prepared.
In this section, we will go over some evalaluation techniques and how they fit into the machine learning workflow as a whole. 

### 1.1. Generalization performance

How do we measure the performance of a machine learning model ? A common solution would be to fit our model on a training sample, then calculate predictions on a test sample and compare them with the actual values of the target variable on this test sample.
However obvious this solution may seem, it is not necessarily that easy. In a perfect world, the estimated performance of a model would tell you exactly how it would perform on unseen data, which is one of the typical goals of machine learning and statistics in general : understand/infer how things work in an unobservably vast reality from observing only a portion of it.
Machine learning often involves experimentation, whether through training different types of models on the data or identifying the best set of hyperparameters for a specific model, we then wish to select the best-performing model, and therefore we need a way to evaluate and rank them. Depending on the problem, different criteria may come into play on top of predictive performance, computational-performance for example may be critical to certain tasks (for example if you need predictive models to run on very light hardware). The main points we wish to evaluate are the followings :

1. Estimate the generalization performance, the predictive performance of our
model on future (unseen) data.
2. Increase the predictive performance by tweaking the hyperparameters and
selecting the best performing model from a given hypothesis space.
3. Identify the machine learning algorithm that is best-suited for the problem at
hand (i.e. Random Forest vs SVM)

This three tasks have in common the necesity to evaluate the model, but they require different approaches. We wish to estimate the future performance of our predictive model as accurately as possible, this is really difficult to do in practice because it will depend on how representative of the reality of our problem our data distribution is. In practice it is very difficult to measure a perfectly representative sample of the reality (it is actually a science in itself used for poles etc...), data is often biasedin practice, however, if all our models have the same bias, then a relative measure of their performance still makes it possible to rank them. For example, if all performance estimates are pessimistically biased, and we underestimate their
performances by 10%, if the performance measured for three models are respectively : 

M2: 75% > M1: 70% > M3: 65%,

we would still rank them the same way if we added a 10% pessimistic bias:

M2: 65% > M1: 60% > M3: 55%.

However we cannot confidently state that the generalization (future prediction) accuracy of the best ranked model (M2) is 65% since we haven't measure the bias. Estimating the absolute performance
of a model is probably one of the most challenging tasks in machine learning.

### 1.2. Assumptions and Terminology

* **i.i.d.** The first assumption that we'll make here is that the data we are working with is made out of independent identicaly distirbuted observations, in short **i.i.d.**, which was the case for all problems that we have studied before. An example of non i.i.d data would be time-series for example, because an observation measured at a given date is dependent on observations measured at anterior dates.

* **Supervised learning** we will start by covering supervised learning, which is sub category of machine learning where a target variable is identified and measured.

* **Classification** This lecture will mainly focus on classification, although several concepts can be applied to regressions problems as well, and the lecture will mention it when it's the case.

* **Accuracy** and **0-1 Loss** The evaluation metric we will focus on for this part of the lecture is the accuracy : the number of accurate predictions devided by the number of observations. We define the 0-1 Loss as the following function 
$$L(\hat{y_i},y_i) = 0 \; if \; \hat{y_i} \neq y_i \; 1 \; if \; \hat{y_i} = y_i$$
Where $\hat{y_i}$ is the model's prediction for observation $i$ and $y_i$ is the target value for observation $i$.
With this definition we can now calculate the prediction error over a certain set of data noted $S$ :

$$
ERR_S = \frac{1}{n}\sum_{i=1}^{n}L(\hat{y_i},y_i)
$$

And therefore the accuracy can be formally written as:

$$
ACC_S = 1 - ERR_S
$$

* **R-Square** for regression models, the evaluation metric most commonly used is **R-square** which is equal to :
$$Rsquare_S = 1 - \frac{SSR}{SST}$$
where $SSR$ is the sum of square residuals :

$$SSR = \sum_{i=1}^{n}(\hat{y_i}-y_i)^2$$

and $SST$ is the sum of square total :

$$SST = \sum_{i=1}^{n}(y_i-\bar{y})^2$$

$\bar{y}$ being the mean of y's distribution, usually approximated by y's empirical mean.


* **Bias** In what follows, the term **Bias** will represent the statistical bias. For an estimator $\hat{\beta}$ the bias can be calculated like this :

$$Bias = E[\hat{\beta}] - \beta$$

Where $E[\hat{\beta}]$ is the expected value of $\hat{\beta}$, in practice the empirical mean, and $\beta$ is the actual value of the considered parameter. Concretely the bias of a model references to the difference between its expected prediction accuracy and the true value of its prediction accuracy. For example the training accuracy (or alternatively R-square) of a model, is often an optimistically biased measure of the absolute accuracy of the model since it is very likely that the model will perform better on the observations it has been trained on rather than on unknowed observations.

* **Variance** The variance is the statistical variance of the estimator whichcan be written :
$$Variance = E[(E[\hat{\beta}]-\hat{\beta})^2]$$
A high variance model means that for small fluctuations in the input, bigger changes can be observed in terms of prediction.

* **Target function** The target function noted $f$ is the true fonction that links our input data to the target variable, i.e. $f(x)=y$. The purpose of supervised machine learning is to approximate $f$.

* **Model** In machine learning the model is a function, often noted $h$ in the scientific litterature (because in ML model and hypothesis are synonymous) is a function that we hope ressembles the unknown target function $f$.

* **Learning Algorithm** the learning algorithm is the process through which we will attempt to make the model fit the target function by exploring the hypothesis space defined by the model choice. For example if the hypothesis space or model choice is linear regression, then the leanring algorithm will determine the coefficients associated with each explanatory variable.

* **Hyperparameters** the hyperparameters are caracteristics of the model that affect the way the learning algorithm will train your model on the data. For example in the context of a DecisionTreeClassifier, an example of hyperparameter is max_depth, which determines the maximum number of successive splits allowed when building the decision tree.

### 1.3. Resubstitution validation and holdout method (train-test split)
In previous applications and exercises, we have used the hold out method extensively. It consists in separating the available data into two subsets, one is the training set on which we will train the model thanks to the training algorithm, the other one is the test set that we will use in order to measure the generalization performance of the trained model. We have done this using the command ```sklearn.model_selection.train_test_split```.
This evaluation method helps us avoid a major bias, indeed the performance of the model on the training set may well be an overly enthusiastic measure of the general performance of the final model, in a situation where the model simply memorizes the training observation training target associations without actually learning anything, then the training performance will be perfect while the general model's performance will probably turn out to be awful.
The fact of evaluating a model on the data that was used to train it is called resubstitution validation/evaluation.

### 1.4 Stratification
A dataset is nothing more than a random sample drawn from a probability distribution, we typically assume that this sample is representative of the true population. Now when subsampling the dataset without replacement affects the distribution of the data (the mean, the variance, and the proportions) of the sample to a degree that is inversely proportionate to the number of observations in the dataset. (subsampling without replacement is what we do when we split the data between train and test set : it needs to be without replacement otherwise some observations would be found in both the training set and the test set which is known as a leak that would invalidate the relevance of the performance indicators calculated on the test set). To verify this fact empirically we could subsample without replacement the iris dataset from sklearn which is relatively small and measure the proportions of the target variable species after the train test split.

Worst case scenario would be that an under-represented class is completely absent from the train set or the test set, making it impossible to accurately train or measure the model's performance on this particular class.
Stratification means that instead of splitting the dataset completely randomly, we split it randomly so that the proportion of each class from the target variable are represented in the subset in the same proportions as the full dataset. Stratification can be easily implemented using the ```stratify``` argument from the ```train_test_split``` function.

### 1.5 Holdout Validation
The figure below provides a visual summary of the holdout method
1. Randomly divide the dataset into a train set and test set (the test set is considered new unseen data and should be used only when measuring the generalization performance)
2. Pick a model and learning algorithm that we think could be a good predictor for the target variable. In this step an external loop may be put in place to optimize the model's hyperparemeters (this can be done in practice with the ```sklearn.model_selection.GridSearchCV``` function)
3. It's time to estimate the general performance of the trained model using data from the test set (in practice we will do this by using the method ```.score()```.)
4. Finally we have an estimate of how well our model performs on unseen data, it is now time to re-train the chosen model on all available data, generally this would improve the general performance of the model compared to the one measured with the test set even though there are no practical ways to measure it.

![holdout](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/holdout_method.PNG)


### 1.6 Pessimistic Bias
The hold out method has a main flaw, it changes the distribution the data because of sampling without replacement, which can be partly fixed through stratification.
The second issue has to do with step four of the hold out validation method. Since we are only able to estimate the general performance of the model trained on the training set, there is no way we can evaluate the general performance of the final model that will be trained on the full dataset. We should be aware that our estimate of the generalization performance is pessimistically biased when training on subsample of the full dataset, and the fewer observations are available, the more pessimistic this bias becomes.

### 1.7 Confidence intervals via normal approximation
The accuracy metric we calculate in order to evaluate our model's general performance is nothing more than a proportion of correct predictions. Therefore it is possible to calculate a confidence interval around this value using the formula for confidence intervals for proportions :
$$p \pm \hat{p} z_{1-\frac{\alpha}{2}} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Where $p$ is the true proportion, $z_{1-\frac{\alpha}{2}}$ is the $1-\frac{\alpha}{2}$ quantile for a $N(0,1)$ normal distribution, $\alpha$ is the confidence level of the interval, $\hat{p}$ is the estimated proportion and $n$ is the number of observations.
When applied to the models accuracy on the test set, we can write that :
$$ACC \pm ACC_S z_{1-\frac{\alpha}{2}} \sqrt{\frac{ACC_S(1-ACC_S)}{n}}$$
However in practice, it would be recommended to repeat the train test split multiple times, measure the test general performance each time and calculate the confidence interval based on the distribution obtained.

### 1.8 Repeated Hold out method
#### Motivation
Any models error can be decomposed into three different values : bias $b^2$, variance $\sigma^2$, and noise $\epsilon^2$.
$$Error = b^2 + \sigma^2 + \epsilon^2$$
The noise cannot be predicted it is by definition something that the model cannot and should not try and predict because it is completely randomly distributed (think of the noise as an unpredictable measurement imprecision when drawing observations from the population to build a dataset, for example, there could be a slight unpredictable error of mean 0 on petal length because of the imprecision of the measurement method).
The bias corresponds to the deviation of the model's prediction from the true mean of the target variable in the population, your model is always a little off-set from its target. A low bias characterises an accurate model.
The variance measures how sensitive the model's predictions are to small changes in the input, it actually measures the variance of the predictions. A low variance characterises a precise model.
The figure below will illustrate these ideas of variance and bias by displaying the four possible combination of accurate/not accurate, precise/not precise.

![bias_variance](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/bias_variance.PNG)

Remember that the bias and variance of a model have opposite variation behaviours, if you wish to lower the variance, you will automatically increase the bias of your model and vice versa.

With that in mind, note that not only the model itself brings in bias and variance, it also stems from the data itself. Remember that all machine learning models are built on the assumption that the dataset we are working with (whether the full dataset, the training set or the test set) are representative of the true distribution observed in the population. Well we have discussed before that splitting the data between a train set and a test set, which is a subsampling without replacement, modifies the distribution of the data, and that these modifications are inversely proportionate to the original sample's size. The Figure below illustrates this idea by simulating two identical distributions with different numbers of observations.

![subsample](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/subsampling_effect.PNG)

We see here that the distributions in the test set and train test are not at all representative of the original distribution, it would take an enormous amount of observations to counter this effect. As a consequence all the general performance estimations we could run on models built from these subsamples would be greatly pessimistically biased.

#### Repeated Hold out
In order to tackle this difficulty, we can rely on a method called repeated hold out, which simply consists in repeating the hold out method a certain number of times in order to evaluate the general performance of a model based on an average of the individual performance indicators derived from each hold out round.This performance indicator should be more robust than those calculated on a single hold out method.
This method is also known as Monte-Carlo cross validation.

## 2. Booststrap method
The Monte-Carlo  cross validation helps us assess the stability of a model's general performance through an average. We will now discuss the bootstrap method that will allow us to measure uncertainty regarding a given model's general performance.
The bootstrap method can be broken down into four steps :

1. We are given a dataset of size n.

2. For $b$ bootstrap rounds:
* We draw one single observation from this dataset and assign it to the jth bootstrap sample.
* We repeat this step until our bootstrap sample has size n – the size of the original dataset.
* Each time, we draw observations from the same original dataset with replacement such that certain examples may
appear more than once in a bootstrap sample and some not at all.

3. We fit a model to each of the $b$ bootstrap samples and compute the resubstitution accuracy, or alternatively the accuracy calculated on the observations that are not in the bootstrap sample.  (which is also called out of bag accuracy, or r-square or MSE if we are talking about regression).

4. We compute the model accuracy as the average over the $b$ accuracy estimates

The main issue with booststrap resubstitution accuracy is the extremely optimistic bias it generates, the out of bag accuracy on the contrary is pessimistically biased and in some cases may only rely on a handful of observations which makes more unstable.
Let's note the resubstitution accuracy $ACC_{r,i}$ as the resubstitution accuracy for the $i^{th}$ boostrap, and $ACC_{h,i}$ the out of bag accuracy for the $i^{th}$ boostrap sample.
Bradley Efron, an american statistician proposed an accuracy estimator able to balance out the optimistic bias from the resubstitution accuracy and the pessimistic bias from the out of bag accuracy with the following formula :
$$ACC_{bootstrap} = \frac{1}{b}\sum_{i=1}^{b}(0.632\times ACC_{h,i} + 0.368\times ACC_{r,i})$$

## 3. Cross-Validation and Hyperparameter Optimization
We have used cross validation extensively in past exercises, this section of the lecture will allow you to add a second layer to what you already know and consolidate your understanding of cross-validation.
We will focus here on k-fold cross-validation as it is the most commonly used method for cross-validation and the one that is implemented in sklearn.

### 3.1. k-fold cross-validation for model evaluation
The k-fold cross-validation method consists in splitting the dataset into k parts. Each part will alternatively play the role of the test set while the other parts will be brought together to form the training set. The model is then trained for each of these k configurations and the performance is measured on the k corresponding test sets. The average performance is then calculated, the resulting average performance corresponds to the k-fold cross-validation general performance estimate.
![kfoldxval_eval](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/kfoldXval.PNG)

### 3.2. Model Selection via k-fold cross validation
This model selection method will slightly differ from the evaluation method, let's now describe how it works :

1. Similar to the holdout method, we split the dataset into two parts, a training and an
independent test set; we tuck away the test set for the final model evaluation step at the end (Step 4).

2. In this second step, we can now experiment with various hyperparameter settings; we could
use grid search, for example. For each hyperparameter
configuration, we apply the k-fold cross-validation method on the training set, resulting in multiple
models and performance estimates.

3. Taking the hyperparameter settings that produced the best results in the k-fold crossvalidation procedure, we can then use the complete training set for model fitting with these settings.

4. Now, we use the independent test set that we withheld earlier (Step 1); we use this test set
to evaluate the model that we obtained from Step 3.

5. Finally, after we completed the evaluation stage, we can optionally fit a model to all data
(training and test datasets combined), which could be the model for (the so-called) deployment.
The following figure will illustrate this process :

![xvalmodelselection](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/cross_val_model_selction.PNG)

### 3.3 3-Way hold out method for Large Datasets
1. We start by splitting our dataset into three parts, a training set for model fitting, a validation
set for model selection, and a test set for the final evaluation of the selected model.
2. This step illustrates the hyperparameter tuning stage. We use the learning algorithm with
different hyperparameter settings (here: three) to fit models to the training data.
3. Next, we evaluate the performance of our models on the validation set. This step illustrates
the model selection stage; after comparing the performance estimates, we choose the hyperparameters
settings associated with the best performance. Note that we often merge steps two and three in
practice: we fit a model and compute its performance before moving on to the next model in order to
avoid keeping all fitted models in memory.
4. As discussed earlier, the performance estimates may suffer from pessimistic bias if the
training set is too small. Thus, we can merge the training and validation set after model selection and
use the best hyperparameter settings from the previous step to fit a model to this larger dataset.
5. Now, we can use the independent test set to estimate the generalization performance our
model. Remember that the purpose of the test set is to simulate new data that the model has not
seen before. Re-using this test set may result in an overoptimistic bias in our estimate of the model’s
generalization performance.
6. Finally, we can make use of all our data – merging training and test set– and fit a model to
all data points for real-world use.

The 3-way hold out method is illustrated in the following figure :
![3wayHoldout](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine%20Learning%20Supervis%C3%A9/model_evaluation/3wayholdout.PNG)
When we browse the deep learning literature, we often find that that the 3-way holdout method is
the method of choice when it comes to model evaluation; it is also common in older (non-deep
learning literature) as well. As mentioned earlier, the three-way holdout may be preferred over k-fold
cross-validation since the former is computationally cheap in comparison. Aside from computational
efficiency concerns, we only use deep learning algorithms when we have relatively large sample
sizes anyway, scenarios where we do not have to worry about high variance – the sensitivity of our
estimates towards how we split the dataset for training, validation, and testing – so much. Thus, it is
fine to use the holdout method with a training, validation, and test split over the k-fold cross-validation
for model selection if the dataset is relatively large.

### 3.4 Normalization of data and cross-validation
We have learned in the past that if any operation was to be applied on our dataset that summoned any metric that was dependent on the distribution of the whole data, then this operation should be fitted on the training set before applying the corresponding transformation to the training set and test set. This is the case for normalization for example. The ```sklearn.preprocessing.StandardScaler``` function is always to be fitted on the train set before transforming the training set and test set.
During cross-validation, several train test splits occur, therefore specific normalization should be applied in each case in order to avoid a leak from the information contained in the specific test set to the train set. This could lead to overfitting problems and bias the evaluation of our models. A solution for this is to package all your preprocessing and your model in a pipeline before applying ```GridSearchCV``` to tune your hyperparameters. 

