<table border="0" style="width:100%">
 <tr>
    <td>
        <img src="https://static-frm.ie.edu/university/wp-content/uploads/sites/6/2022/06/IE-University-logo.png" width=150>
     </td>
    <td><div style="font-family:'Courier New'">
            <div style="font-size:25px">
                <div style="text-align: right"> 
                    <b> MASTER IN BIG DATA</b>
                    <br>
                    Python for Data Analysis II
                    <br><br>
                    <em> Daniel Sierra Ramos </em>
                </div>
            </div>
        </div>
    </td>
 </tr>
</table>

# **S09: ROBUSTNESS - CROSS VALIDATION**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "none"

# The train-test split is not flawless

- When we tackle a machine learning problem, one of the most important operations we do is splitting the data into a **training set** and a **test set**.
    - Train Set - Used to train the model
    - Test Set - Used to evaluate the model
- We use this split to **avoid overfitting**, that is, to avoid that the model learns too much from the training data, and therefore, it doesn't generalize well on unseen data.

**However, the train-test split is too simple and it's not enough**

<center><img src="https://d33wubrfki0l68.cloudfront.net/c39b2d19183ed14141a8b7b03943442d40efee0d/81e2a/wp-content/uploads/2019/03/train_test_split.png" width="700"/></center>

### When *hyperparameters* come into play

- When we **adjust the hyperparameters of the model**, let's say the regularization strength or the depth in a Decision Tree model, we inevitably tend to use the **test set** to *tweak the hyperparamenters* in order to obtain the best results as possible in the test set.
> Be careful!: Doing this automatically turns the test set in **seen data**, not **unseen data**. At the emd, we're using the test set to train ("improve") the model.

- To solve this problem, we can use another set to validate the hyperparameters called **validation set**. This set can be used to validate the hyperparamenters, and then, we use the test set to evaluate the model with **real unseen data**.

<center><img src="https://cdn.shortpixel.ai/spai/q_lossy+w_730+to_webp+ret_img/https://algotrading101.com/learn/wp-content/uploads/2020/06/training-validation-test-data-set.png" width="700"/></center>


### Right, but now I have less data for training :(

- Having a validation set reduces the amount of data we have to train the model. This can compromise the model's performance, especially if we have a small dataset.

- Also, having just a validation set still have one unresolved problem: Is my model robust if trained on different sets of data?

To solve all this problems, we can use **cross-validation**.

# **Cross-Validation**: Building robust models

 - The **cross validation** is a technique commonly used in machine learning to measure the robustness of a model, and in essence, to measure the bias and variance. Have a look to this article: https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

 - Also, the CV strategy is used for **hyperparameter tuning** in order to find the best set of hyperparameters for the model.

 - It consists on creating a new set of data called the **validation set** in order to test the model robustness in several training settings with different data.

 - The most relevant cross validation strategies are **K-Fold CV** and **Leave-One-Out CV**.

 - **Cross-validation** has become a standard in the ML process, and **is part of the general set up in a machine learning model design**. So, it's important to know how to use it.

<center><img src="https://elitedatascience.com/wp-content/uploads/2017/06/Train-Test-Split-Diagram.jpg" width="700"/></center>

### **K-Fold CV**

Consists on dividing the training data in $K$ folds, and create $K$ models rotating the set of data we use for training. The typical setting is: use $k-1$ folds for training and the last one for validation. Then repeat rotating the validation set.

<center><img src="https://miro.medium.com/max/720/1*qPMFLEbvc8QQf38Cf77wQg.png" width="700"/></center>

### **Leave-One-Out CV**

It's the extrema version of K-Fold. Consists on using $N-1$ samples for training and the last one for validation. Then repeat rotating the validation set. That is, the number of folds is equal to the number of samples in the dataset.

## Cross-Validation in `scikit-learn`

### The `cross_val_score` function

Given a trained model, the `cross_val_score` function allows us to evaluate the model using cross-validation. It takes as input the model, the data, the target, and the number of folds. It returns the score of the model in each fold.

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
```

In [40]:
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC

In [14]:
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Let's evaluate a SVM model using cross-validation.

In [15]:
# Train and evaluate on test set
clf = SVC(kernel='linear', C=1)

In [16]:
# Train and evalute ion training set
cross_val_score(clf, X_train, y_train, cv=5, scoring="precision")

array([0.93181818, 0.95348837, 0.97560976, 0.95348837, 0.93333333])

By default, the `cross_val_score` is training the model on the different folds and then calculating the `precision` with the holdout set.

Now, let's evaluate the model using the `recall` metric.

In [17]:
# Train and evalute ion training set
cross_val_score(clf, X_train, y_train, cv=5, scoring="recall")

array([0.95348837, 0.95348837, 0.95238095, 0.97619048, 1.        ])

Now, let's use the LOO strategy.

In [30]:
from sklearn.model_selection import LeaveOneOut, KFold

In [31]:
X_train.shape

(341, 30)

In [32]:
# LOO taks too much time because it's to fit 341 models. Let's reduce the dataset to 20 sample to illustrate this example
X_train_red = X_train[:20]
y_train_red = y_train[:20]

In [34]:
cross_val_score(clf, X_train_red, y_train_red, cv=LeaveOneOut(), scoring="accuracy")

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1.,
       0., 1., 1.])

### Let's rock with the `cross_validate` function.

The ``cross_validate`` function is similar to the ``cross_val_score`` function, but it allows us to use different metrics and to return the training time and training scores.

```python
from sklearn.model_selection import cross_validate

scores = cross_validate(model, X, y, cv=5, scoring=['precision', 'recall'], return_train_score=True)
```

In [35]:
from sklearn.model_selection import cross_validate

In [38]:
results = cross_validate(clf, X_train, y_train, cv=5, scoring=['precision', 'recall'], return_train_score=True)

results

{'fit_time': array([0.2752738 , 0.34882474, 0.17294383, 0.33057857, 0.20749354]),
 'score_time': array([0.00152445, 0.00146008, 0.00151134, 0.00153732, 0.00178981]),
 'test_precision': array([0.93181818, 0.95348837, 0.97560976, 0.95348837, 0.93333333]),
 'train_precision': array([0.96511628, 0.96511628, 0.97660819, 0.96551724, 0.98214286]),
 'test_recall': array([0.95348837, 0.95348837, 0.95238095, 0.97619048, 1.        ]),
 'train_recall': array([0.98224852, 0.98224852, 0.98235294, 0.98823529, 0.97058824])}

We can transform the dictionary into a dataframe and calculate the mean and standard deviation of the scores to evaluate the general performance of the model. The optimum result is:
- *Train and test scores are **very similar***
- *The average of the scores is **high***
- *The standard deviation of the scores is **low***

In [42]:
results = pd.DataFrame(results)
results

Unnamed: 0,fit_time,score_time,test_precision,train_precision,test_recall,train_recall
0,0.275274,0.001524,0.931818,0.965116,0.953488,0.982249
1,0.348825,0.00146,0.953488,0.965116,0.953488,0.982249
2,0.172944,0.001511,0.97561,0.976608,0.952381,0.982353
3,0.330579,0.001537,0.953488,0.965517,0.97619,0.988235
4,0.207494,0.00179,0.933333,0.982143,1.0,0.970588


In [43]:
results.mean()

fit_time           0.267023
score_time         0.001565
test_precision     0.949548
train_precision    0.970900
test_recall        0.967110
train_recall       0.981135
dtype: float64

In [44]:
results.std()

fit_time           0.076147
score_time         0.000129
test_precision     0.017941
train_precision    0.007982
test_recall        0.020930
train_recall       0.006435
dtype: float64

## Other cross-validation strategies

### **Stratified K-Fold**

The K-Fold CV is not always the best option. For example, if we have a dataset with a high class imbalance, the K-Fold CV can lead to a model that is not able to generalize well on the minority class. To solve this problem, we can use the **Stratified K-Fold CV**. This strategy ensures that the proportion of each class is the same in each fold.

```python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)
```

### **Group K-Fold**

The Group K-Fold CV is used when we have a dataset with groups. For example, if we have a dataset with several patients, and we want to predict the disease progression of each patient, we can use the Group K-Fold CV to ensure that the model is trained with data from the same patient in the training set and in the validation set.

```python
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
```

### **Time Series Split**

The Time Series Split CV is used when we have a time series dataset. For example, if we have a dataset with the stock price of a company, we can use the Time Series Split CV to ensure that the model is trained with data from the past and validated with data from the future.

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
```

## What's next?

Once we have evaluated the model using cross-validation, we need to **train** the final model with the **entire training set** and **evaluate** it with the **test set**. This is the final step of the ML process.