# Validation and Test

## Learning objectives

Understand what terms below are all about:

- cross validation
- test set
- random seeding
- data leakage

# Cross validation

First of all, __what is validation set?__ Simply put:

> some part (some percentage) of our training data left out of training dataset

Why would we leave it, isn't data precious and as many samples should be used for training as possible (so you guys said last time)?

Yes, but __you should always have at least a validation set (preferably test or even more splits as we'll see in the next notebook)__. We will use it (validation) to see how our model performs on unseen data __and choose the best one according to it__.

__Remember:__ Our end goal is creating algorithms which works on __new data__, __training on the training set is only a means to an end__.

## Test set

The story with the test set is a little bit different. It is also some part of the data left out of training, __but with respect to which we make no decisions about our algorithm__. 

- It is simply informative and is an approximation of how our model will do "in the wild"
- __In practice we may have no access to a real life test set__
- As it __cannot__ affect our decision process we usually optimize model based on validation set only (or variations of it which you gonna see later)

Say we trained our model and want it running for questions users post on the website. Questions coming from those users will be our real test set.

__In this notebook we will use test set__ the way one always should: just to check how our model does, whether it works fine at all. We should __never optimize according to it__.

## Sklearn split

We will train Linear regression model once again. Our focus will be on data, not on the algorithm as this technique is used for most models you are going to encounter.

## Exercise

- Load Boston dataset from `sklearn` (as previously)
- Use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to get `X` and `y` split into `X_train, X_test, y_train, y_test` (use `0.3` for `test_size`)
- Do it once more on `X_test` and `y_test` to get `X_test, X_validation, y_test, y_validation` (split testing part of data in half)

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_boston(return_X_y=True)

print(f"Number of samples in dataset: {len(X)}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(
    X_test, y_test, test_size=0.5
)

print("Number of samples in:")
print(f"    Training: {len(y_train)}")
print(f"    Validation: {len(y_validation)}")
print(f"    Testing: {len(y_test)}")

Typically, 10-40% of data is left out of training either for validation or for test set also.

The less data for validation, the less reliable will our estimates be, but more training samples will allow our algorithm to learn more.

## Performing validation

You probably know there are a lot of machine learning algorithms. How can we know and test which one should be used for our problem?

Besides __domain knowledge about specific problem__, cross-validation can always help us. Let's train a few algorithms and choose the best based on validation set performance.

## Exercise

- Some setup is already in place, try to stick to it (or not, experimenting is always a good thing)
- Create all the models inside list. Pass `splitter=random` to `DecisionTreeRegressor`
- Inside loop do the following:
    - fit model in training dataset
    - predict on training features
    - predict on validation features
    - predict on test features
    - calculate `mean_squared_error` on all of the predictions
- You can rewrite `print` statement or leave as it is, just remember about correct variable names in your loop!

In [None]:
# ML algorithms you will later know, don't panic
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

np.random.seed(2)

models = [DecisionTreeRegressor(splitter="random"), SVR(), LinearRegression()]

for model in models:
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_validation_pred = model.predict(X_validation)
    y_test_pred = model.predict(X_test)

    train_loss = mean_squared_error(y_train, y_train_pred)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)
    test_loss = mean_squared_error(y_test, y_test_pred)

    print(
        f"{model.__class__.__name__}: "
        f"Train Loss: {train_loss} | Validation Loss: {validation_loss} | "
        f"Test Loss: {test_loss}"
    )

## Analysis

As you can see, best Validation Loss is for Linear Regression. This is the model we should choose. Unfortunately it occurs that on "real" (test) data it performs worse than Decision Tree.

Once again: usually we won't have information from test loss, but now you should know this technique is imperfect (we will see how to mitigate those effects later on)

## Seeding

Above we had this line: `np.random.seed(2)`. It is actually really important and we should know what is going on.

### Pseudo-random number generators

Many machine learning algorithms use random initialization (for example to instantiate the parameters of a linear regression model). Depending on algorithm it might have more or less severe effect on the result.

- Each time you run algorithm based on the randomness the result may vary to some degree
- Random number generators use so called `seed` which is a numerical value which determines what values will be generated
- For each run to be the same (or to show some phenomenon like we did above) we should __always__ seed all functions using random numbers

The last one is pretty easy in `numpy` and `sklearn` as it is a single line. Seeding this way is present in most of the frameworks.

### Why seed?

- When you want your experiments to be reproducible (especially important in Machine Learning)
- To be sure the outcome will not change during each run

So do it, it is good practice :) 

## Data Leakage

This is another deadly sin and you have to __avoid it all cost__.

> Data leakage happens when the same data (or part of it) is used for both training and validation

### Data Leakage examples

- Normalizing images by counting mean and variance across __whole dataset__ instead of __training dataset only__. This gives the model information about mean and variance from validation dataset which may artificially boost it's performance
- Bad data splitting - some samples are both in training and validation

Let's see the last example in action...

## Exercise

- Create a function `calculate_validation_loss` which does the following:
    - Creates `LinearRegression` model
    - Fits it to training data (`X_train` and `y_train`)
    - Predicts on validation data and calculates MSE (mean squared error)
    - Prints validation loss to CLI
    
Run the rest of this cell and see what results you observe

In [None]:
def calculate_validation_loss(X_train, y_train, X_validation, y_validation):
    model = LinearRegression()

    # Without data leakage, train on train, validate on validation
    model.fit(X_train, y_train)
    y_validation_pred = model.predict(X_validation)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)

    print(f"Validation loss: {validation_loss}")
    
# Without data leakage, train on train, validate on validation
calculate_validation_loss(X_train, y_train, X_validation, y_validation)

# With data leakage, 50 samples from validation added
fail_X_train = np.concatenate((X_train, X_validation[:50]))
fail_y_train = np.concatenate((y_train, y_validation[:50]))

calculate_validation_loss(fail_X_train, fail_y_train, X_validation, y_validation)

As expected, as the model saw part of validation data and it __falsely__ performs better on it.

## Challenges

- Change seed values and see how this affects our results
- Write at least one additional downside of using validation set based on above experiments
- How the results change based on `test_train_split` values? What happens if you give more/less data for training/validation/test phase? Experiment and make a report with a few sentences

## Summary

- Validation set is used to find info about best algorithms, best set of arguments to algoirthms etc.
- Test set is used to check how our algorithm performs on unseen data
- __As we tune algorithms according to `validation` dataset we cannot use it to check performance__
- `seed` is used to ensure reproducibility. Also multiple runs for experiments are good if our code depends on random initialization heavily (we can take mean results of experiments)
- Data leakage is information from `validation` (or `test`) leaking into training
- Data leakage leads to falsely good results and should be avoided
- Rule of thumb: imagine you only have training dataset when doing preprocessing. Anything you calculate from it cannot be used in `validation` or `test`