# **S09: ROBUSTNESS - CROSS VALIDATION**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "none"

# The train-test split is not flawless

- When we tackle a machine learning problem, one of the most important operations we do is splitting the data into a **training set** and a **test set**.
    - Train Set - Used to train the model
    - Test Set - Used to evaluate the model
- We use this split to **avoid overfitting**, that is, to avoid that the model learns too much from the training data, and therefore, it doesn't generalize well on unseen data.

**However, the train-test split is too simple and it's not enough**

<center><img src="https://d33wubrfki0l68.cloudfront.net/c39b2d19183ed14141a8b7b03943442d40efee0d/81e2a/wp-content/uploads/2019/03/train_test_split.png" width="700"/></center>

### When *hyperparameters* come into play

- When we **adjust the hyperparameters of the model**, let's say the regularization strength or the depth in a Decision Tree model, we inevitably tend to use the **test set** to *tweak the hyperparamenters* in order to obtain the best results as possible in the test set.
> Be careful!: Doing this automatically turns the test set in **seen data**, not **unseen data**. At the emd, we're using the test set to train ("improve") the model.

- To solve this problem, we can use another set to validate the hyperparameters called **validation set**. This set can be used to validate the hyperparamenters, and then, we use the test set to evaluate the model with **real unseen data**.

<center><img src="https://cdn.shortpixel.ai/spai/q_lossy+w_730+to_webp+ret_img/https://algotrading101.com/learn/wp-content/uploads/2020/06/training-validation-test-data-set.png" width="700"/></center>


### Right, but now I have less data for training :(

- Having a validation set reduces the amount of data we have to train the model. This can compromise the model's performance, especially if we have a small dataset.

- Also, having just a validation set still have one unresolved problem: Is my model robust if trained on different sets of data?

To solve all this problems, we can use **cross-validation**.

# **Cross-Validation**: Building robust models

 - The **cross validation** is a technique commonly used in machine learning to measure the robustness of a model, and in essence, to measure the bias and variance. Have a look to this article: https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

 - Also, the CV strategy is used for **hyperparameter tuning** in order to find the best set of hyperparameters for the model.

 - It consists on creating a new set of data called the **validation set** in order to test the model robustness in several training settings with different data.

 - The most relevant cross validation strategies are **K-Fold CV** and **Leave-One-Out CV**.

 - **Cross-validation** has become a standard in the ML process, and **is part of the general set up in a machine learning model design**. So, it's important to know how to use it.

<center><img src="https://elitedatascience.com/wp-content/uploads/2017/06/Train-Test-Split-Diagram.jpg" width="700"/></center>

### **K-Fold CV**

Consists on dividing the training data in $K$ folds, and create $K$ models rotating the set of data we use for training. The typical setting is: use $k-1$ folds for training and the last one for validation. Then repeat rotating the validation set.

<center><img src="https://miro.medium.com/max/720/1*qPMFLEbvc8QQf38Cf77wQg.png" width="700"/></center>

### **Leave-One-Out CV**

It's the extrema version of K-Fold. Consists on using $N-1$ samples for training and the last one for validation. Then repeat rotating the validation set. That is, the number of folds is equal to the number of samples in the dataset.

## Cross-Validation in `scikit-learn`

### The `cross_val_score` function

Given a trained model, the `cross_val_score` function allows us to evaluate the model using cross-validation. It takes as input the model, the data, the target, and the number of folds. It returns the score of the model in each fold.

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
```

In [24]:
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC

In [44]:
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Let's evaluate a SVM model using cross-validation.

In [45]:
y_train

array([1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,

In [46]:
X_train.shape

(426, 30)

In [28]:
X_train.shape

(426, 30)

In [29]:
X_test.shape

(143, 30)

In [30]:
# Train and evaluate on test set
clf = SVC(kernel='linear', C=1)

actual values are disease, not disease
prediction disease, not disease

confusion matrix:

|  | predicted positive | predicted negative |
|---|---|---|
| real positive | tp | fn |
| real not negative | fp | tn |

precision = tp / (tp + fp)

recall = tp / (tp + fn)

accuracy = (tp + tn) / (tp + tn + fp + fn)

|  | predicted positive | predicted negative |
|---|---|---|
| real positive | 90 | 2 |
| real not negative | 3 | 5 |

accuracy = (90 + 5) / (90 + 5 + 2 + 3) = 95 / 100 = 0.95



|  | predicted positive | predicted negative |
|---|---|---|
| real positive | 94 | 2 |
| real not negative | 3 | 1 |

accuracy = (90 + 5) / (90 + 5 + 2 + 3) = 95 / 100 = 0.95

In [50]:
# Train and evalute ion training set
results = cross_val_score(clf, X_train, y_train, cv=10, scoring="precision")

print(np.mean(results))
print(np.std(results))

0.9522385022385024
0.028755205234580215


In [55]:
import plotly.express as px

px.histogram(results)

By default, the `cross_val_score` is training the model on the different folds and then calculating the `precision` with the holdout set.

Now, let's evaluate the model using the `recall` metric, and changing the number of folds to 5.

In [56]:
# recall = tp / (tp + fn)

# Train and evalute ion training set
cross_val_score(clf, X_train, y_train, cv=5, scoring="recall")

array([1.        , 0.96296296, 0.9245283 , 0.98113208, 1.        ])

Now, let's use the LOO strategy.

In [57]:
from sklearn.model_selection import LeaveOneOut, KFold

In [58]:
X_train.shape

(426, 30)

In [59]:
# LOO taks too much time because it's to fit 341 models. Let's reduce the dataset to 20 sample to illustrate this example
X_train_red = X_train[:20] # first 20 rows
y_train_red = y_train[:20]

In [60]:
perf = cross_val_score(clf, X_train_red, y_train_red, cv=LeaveOneOut(), scoring="accuracy")

perf

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1.])

### Let's rock with the `cross_validate` function.

The ``cross_validate`` function is similar to the ``cross_val_score`` function, but it allows us to use different metrics and to return the training time and training scores.

```python
from sklearn.model_selection import cross_validate

scores = cross_validate(model, X, y, cv=5, scoring=['precision', 'recall'], return_train_score=True)
```

In [61]:
from sklearn.model_selection import cross_validate

In [62]:
results = cross_validate(
    clf,
    X_train, y_train,
    cv=5,
    scoring=['precision', 'recall'],
    return_train_score=True
)

results

{'fit_time': array([0.63604712, 0.17565298, 0.15486693, 0.25058579, 0.26907396]),
 'score_time': array([0.00189495, 0.00157499, 0.00132799, 0.00137329, 0.0013659 ]),
 'test_precision': array([0.98181818, 0.94545455, 0.90740741, 0.98113208, 0.94642857]),
 'train_precision': array([0.96330275, 0.96744186, 0.97695853, 0.96759259, 0.98122066]),
 'test_recall': array([1.        , 0.96296296, 0.9245283 , 0.98113208, 1.        ]),
 'train_recall': array([0.98591549, 0.97652582, 0.99065421, 0.97663551, 0.97663551])}

We can transform the dictionary into a dataframe and calculate the mean and standard deviation of the scores to evaluate the general performance of the model. The optimum result is:
- *Train and test scores are **very similar***
- *The average of the scores is **high***
- *The standard deviation of the scores is **low***

In [67]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(
    _matrix(y_pred, y_test))
print(accuracy_score(y_pred, y_test))
print(precision_score(y_pred, y_test))
print(recall_score(y_pred, y_test))

[[52  5]
 [ 1 85]]
0.958041958041958
0.9444444444444444
0.9883720930232558


In [63]:
results = pd.DataFrame(results)
results

Unnamed: 0,fit_time,score_time,test_precision,train_precision,test_recall,train_recall
0,0.636047,0.001895,0.981818,0.963303,1.0,0.985915
1,0.175653,0.001575,0.945455,0.967442,0.962963,0.976526
2,0.154867,0.001328,0.907407,0.976959,0.924528,0.990654
3,0.250586,0.001373,0.981132,0.967593,0.981132,0.976636
4,0.269074,0.001366,0.946429,0.981221,1.0,0.976636


In [42]:
results.mean()

fit_time           0.263650
score_time         0.001391
test_precision     0.952448
train_precision    0.971303
test_recall        0.973725
train_recall       0.981273
dtype: float64

In [43]:
results.std()

fit_time           0.166213
score_time         0.000176
test_precision     0.030819
train_precision    0.007467
test_recall        0.031511
train_recall       0.006616
dtype: float64

In [69]:
y.mean()

0.6274165202108963

## Other cross-validation strategies

### **Stratified K-Fold**

The K-Fold CV is not always the best option. For example, if we have a dataset with a high class imbalance, the K-Fold CV can lead to a model that is not able to generalize well on the minority class. To solve this problem, we can use the **Stratified K-Fold CV**. This strategy ensures that the proportion of each class is the same in each fold.

```python
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)
```

### **Group K-Fold**

The Group K-Fold CV is used when we have a dataset with groups. For example, if we have a dataset with several patients, and we want to predict the disease progression of each patient, we can use the Group K-Fold CV to ensure that the model is trained with data from the same patient in the training set and in the validation set.

```python
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
```

### **Time Series Split**

The Time Series Split CV is used when we have a time series dataset. For example, if we have a dataset with the stock price of a company, we can use the Time Series Split CV to ensure that the model is trained with data from the past and validated with data from the future.

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
```

100 rows
97 % positive
3 % negative

## What's next?

Once we have evaluated the model using cross-validation, we need to **train** the final model with the **entire training set** and **evaluate** it with the **test set**. This is the final step of the ML process.

Let's do it!

## Live practice

Given the `diabetes` dataset, we will evaluate a SVM model using cross-validation and then train the final model with the entire training set and evaluate it with the test set.

We can load the dataset using the `load_diabetes` function from the `sklearn.datasets` module.
* [load_diabetes](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)

```python
from sklearn.datasets import load_diabetes

data = load_diabetes()
x, y = data.data, data.target

# using the feature names
x = pd.DataFrame(X, columns=data.feature_names)
```

In [70]:
from sklearn.datasets import load_diabetes

data = load_diabetes()

x, y = data.data, data.target

x = pd.DataFrame(x, columns=data.feature_names)

In [71]:
x.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [72]:
px.histogram(y)

In [76]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)

print(x_train.shape)
print(y_test.shape)

print(y_train.shape)
print(x_test.shape)

(353, 10)
(89,)
(353,)
(89, 10)


In [77]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

In [79]:
from sklearn.metrics import r2_score, mean_squared_error

results = cross_validate(
    reg,
    x_train, y_train,
    cv=5,
    scoring=['neg_mean_squared_error', 'r2'],
    return_train_score=True
)

results

{'fit_time': array([0.01300907, 0.00192595, 0.00242901, 0.00257397, 0.00415492]),
 'score_time': array([0.00217986, 0.00215912, 0.00086498, 0.0028832 , 0.00145698]),
 'test_neg_mean_squared_error': array([-3234.43535508, -3493.97275833, -2653.95772145, -3241.22843086,
        -2881.14535159]),
 'train_neg_mean_squared_error': array([-2838.13825005, -2811.8653821 , -3004.68591542, -2845.35144042,
        -2940.94343485]),
 'test_r2': array([0.36597401, 0.45581359, 0.51771404, 0.53489078, 0.4165442 ]),
 'train_r2': array([0.52857357, 0.50386375, 0.49241009, 0.48312651, 0.51412021])}

In [80]:
pd.DataFrame(results)

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,train_neg_mean_squared_error,test_r2,train_r2
0,0.013009,0.00218,-3234.435355,-2838.13825,0.365974,0.528574
1,0.001926,0.002159,-3493.972758,-2811.865382,0.455814,0.503864
2,0.002429,0.000865,-2653.957721,-3004.685915,0.517714,0.49241
3,0.002574,0.002883,-3241.228431,-2845.35144,0.534891,0.483127
4,0.004155,0.001457,-2881.145352,-2940.943435,0.416544,0.51412


In [83]:
from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators=50)

results = cross_validate(
    reg,
    x_train, y_train,
    cv=5,
    scoring=['neg_mean_squared_error', 'r2'],
    return_train_score=True
)

pd.DataFrame(results)

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,train_neg_mean_squared_error,test_r2,train_r2
0,0.078751,0.002218,-3422.68387,-512.467496,0.329073,0.914877
1,0.062451,0.002136,-3508.935628,-489.941157,0.453483,0.913553
2,0.057919,0.002133,-3237.117093,-493.703675,0.41174,0.916597
3,0.062444,0.002094,-3570.617571,-490.464857,0.487624,0.910904
4,0.059406,0.002454,-3027.1904,-551.659893,0.386969,0.908859


In [84]:
reg = RandomForestRegressor(n_estimators=200)

results = cross_validate(
    reg,
    x_train, y_train,
    cv=5,
    scoring=['neg_mean_squared_error', 'r2'],
    return_train_score=True
)

pd.DataFrame(results)

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,train_neg_mean_squared_error,test_r2,train_r2
0,0.252366,0.007157,-3414.903215,-471.8392,0.330598,0.921626
1,0.229239,0.006548,-3473.861613,-474.345024,0.458946,0.916305
2,0.244259,0.00855,-2967.782319,-486.699072,0.460685,0.917781
3,0.240476,0.006142,-3585.35673,-469.85552,0.485509,0.914648
4,0.245439,0.008453,-3050.057784,-487.883593,0.382338,0.919396
