# Cross-Validation
<hr style="border:2px solid black">

## 1. Overfitting & Underfitting

>- ML models should make sensible predictions with unseen data
>- over- & underfitting: most common causes of poor model performance

<img src="fitting.png" width="700"/>

**Q: What happens during fitting?**

>- model learns from data and adjusts its own parameters 
>- model tries to improve its understanding of data

**Underfit Model**

>- model does not fit training data well enough
>- too simple model with high bias, low variance
>- poor training accuracy, and poor test accuracy

**Overfit Model**  

>- model fits training data too well
>- too complex model with low bias, high variance
>- good training accuracy, but poor test accuracy

**How to avoid overfitting?**

>- feature selection 
>- less complex model with fewer parameters
>- regularization
>- cross-validation
>- data augentation: bootstrap, synthetic oversampling

***A good model generalizes well: neither underfit nor overfit.***
<br>
***Bias-variance trade-off: desirable performance in between over- & underfitting***

<hr style="border:2px solid black">

## 2. Example: Penguin Dataset

**Load packages**

In [1]:
# data analysis stack
import pandas as pd
import numpy as np

# data visualization stack
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # set seaborn as default style

# data pre-processing stack
from sklearn.preprocessing import MinMaxScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

#machine learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

**Load data**

In [2]:
df = sns.load_dataset("penguins")
df.dropna(inplace=True)
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [3]:
df.info() # quick data exploration

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB


**Features and target variable**

In [4]:
num_features = ['bill_length_mm',
                'bill_depth_mm',
                'flipper_length_mm'
               ]*True

cat_features = ['species',
                'island',
                'sex'
               ]*True

features = num_features + cat_features

target_variable = 'body_mass_g'

In [5]:
features

['bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'species',
 'island',
 'sex']

In [6]:
# feature and target columns
X,y = df[features],df[target_variable]

**Train-test split**

In [7]:
X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size=0.2,random_state=42
)
X_train.shape, X_test.shape

((266, 6), (67, 6))

**Feature engineering**

In [8]:
# column transformation
transformer = ColumnTransformer(
    transformers=[
        ('scaling', MinMaxScaler(), num_features),
        ('onehot', OneHotEncoder(drop='first'), cat_features)
    ])

**Model building**

In [9]:
# pipeline
pipeline = Pipeline(
    steps=[
        ('transformer', transformer),  # column transformation
        ('lr_model', LinearRegression())  # linear fit
    ])

In [10]:
# model training
pipeline.fit(X_train,y_train);

**Model Performance**

In [11]:
# training score
train_score = pipeline.score(X_train,y_train)

# test score
test_score = pipeline.score(X_test,y_test)

print(f'train score: {round(train_score,6)}')
print(f'test score : {round(test_score,6)}')

train score: 0.869595
test score : 0.896169


- ***Comparison of the performances on the train and the test data tells us how good our model is at generalizing***
- ***But the results depend on how the data has been (randomly) split***

<hr style="border:2px solid black">

## 3. Cross-Validation

***Q: How can we reliably measure the generalizability of a model without using test data?***

### 2.1 Validation Set Approach

<img src="validation_set_approach.png" width="700"/>

**Cons:**
>- sampling bias
>- underfitting

### 2.2 Leave-One-Out Cross-Validation
<br>

<img src="loocv.png" width="700"/>

**Cons:**
>- computationally expensive
>- overfitting (high variance)

### 2.3 *k*-Fold Cross-Validation

>- split data into *k* subsets or *folds*
>- reserve one fold as validation set and train on the remaining *k-1*
>- train and evaluate *k* separate models
>- allows evaluation of model robustness

<img src="crossval.png" width="700"/>

### 2.4 k-Fold CV Implementation

In [12]:
from sklearn.model_selection import (
    cross_validate,
    cross_val_score
)

- [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)

- [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score#sklearn.model_selection.cross_val_score)

In [13]:
kf_cv = cross_validate(
    estimator = pipeline, # model to evaluate
    X = X_train,
    y = y_train,
    cv = 5,        # number of cross-validation split
    scoring ='r2', # evaluation metric
    return_train_score=True
)

In [14]:
kf_cv

{'fit_time': array([0.0056994 , 0.00348139, 0.00538278, 0.00327325, 0.00351191]),
 'score_time': array([0.00222111, 0.00228643, 0.00370026, 0.00183415, 0.0021224 ]),
 'test_score': array([0.85652998, 0.86369923, 0.90838863, 0.82870876, 0.82735791]),
 'train_score': array([0.87088114, 0.87010661, 0.85831696, 0.87740042, 0.87684306])}

In [18]:
train_mean = round(kf_cv['train_score'].mean(), 6)
train_std = round(kf_cv['train_score'].std(), 6)
val_mean = round(kf_cv['test_score'].mean(), 6)
val_std = round(kf_cv['test_score'].std(), 6)

print(f"training score  : {train_mean} \u00B1 {train_std}")
print(f"validation score: {val_mean} \u00B1 {val_std}")

training score  : 0.87071 ± 0.006875
validation score: 0.856937 ± 0.029546


***Interpretation of results:***

>- mean training score >> mean validation score: overfitting
>- mean training score << mean validation score: underfitting
>- large std of validation scores: high sampling bias; need more data, different model/ hyperparameters

**Cross-Validators**

In [19]:
from sklearn.model_selection import KFold

- [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold): *k*-folds cross-validator, splits dataset into *k* consecutive folds
- [`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold): returns stratified folds that preserving the percentage of samples for each class

In [20]:
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)
# k_fold = KFold(n_splits=5)

In [21]:
scores_kf_cv = cross_val_score(
    estimator = pipeline, # model to evaluate
    X = X_train,
    y = y_train,
    scoring ='r2', # evaluation metric
    cv = k_fold    # cross-validation splitting
) 

In [24]:
scores_kf_cv

array([0.80008555, 0.84903344, 0.84646203, 0.89227158, 0.859357  ])

In [25]:
val_mean = round(scores_kf_cv.mean(), 6)
val_std = round(scores_kf_cv.std(), 6)

print(f"validation score: {val_mean} \u00B1 {val_std}")

validation score: 0.849442 ± 0.02959


**Shuffle-Split**

In [26]:
from sklearn.model_selection import ShuffleSplit

- [`ShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) used if data not shuffled in train-test-split; do not guarantee entirely different folds
- [`StratifiedShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) makes folds that preserve the percentage of samples for each class

In [27]:
shuffle_split = ShuffleSplit(n_splits=10,test_size=0.2,random_state=42)

scores_ss = cross_val_score(
    estimator = pipeline, # the model to evaluate
    X = X_train,
    y = y_train,
    scoring ='r2',      # evaluation metric
    cv = shuffle_split, # shuffled cross-validation splitting
)

In [28]:
scores_ss

array([0.80008555, 0.86550429, 0.82963158, 0.88085133, 0.88886132,
       0.88588237, 0.85660824, 0.82773514, 0.89217586, 0.80681309])

In [29]:
val_mean = round(scores_ss.mean(), 6)
val_std = round(scores_ss.std(), 6)

print(f"validation score: {val_mean} \u00B1 {val_std}")

validation score: 0.853415 ± 0.03311


<hr style="border:2px solid black">

## 4. Bootstrapping

>- iteratively resampling a dataset with replacement (given sample size and number of repeats)
>- useful for estimating summary statistics such as mean or standard deviation

<img src="bootstrap.png" width="700"/>

In [30]:
from sklearn.utils import resample

- [`resample`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html): resample arrays (one step of bootstrapping procedure)

**Example of sampling with replacement**

In [31]:
my_list = list(range(10))

for i in range(10):
    print(resample(my_list))

[1, 3, 4, 9, 3, 2, 6, 2, 3, 7]
[4, 8, 0, 7, 0, 2, 3, 7, 7, 0]
[7, 4, 7, 3, 1, 8, 8, 9, 8, 4]
[5, 5, 3, 6, 8, 9, 3, 8, 9, 0]
[3, 2, 0, 6, 6, 3, 6, 7, 5, 4]
[4, 4, 9, 3, 5, 3, 4, 5, 4, 9]
[8, 6, 2, 5, 3, 2, 2, 5, 5, 1]
[3, 7, 5, 0, 2, 6, 5, 6, 3, 1]
[6, 5, 1, 5, 8, 6, 5, 6, 8, 0]
[8, 9, 7, 4, 5, 3, 0, 1, 9, 1]


### Penguin Dataset: Summary Statistics

In [32]:
train_scores = []
val_scores = []

for i in range(1000):
    # resample original data to create a "new" dataset
    Xb, yb = resample(X_train,y_train)
    
    # split data into training and validation sets
    Xb_train, Xb_val, yb_train, yb_val = \
    train_test_split(Xb,yb,test_size=0.2)
    
    # fit the model and calculate train and validation scores
    model = pipeline.fit(Xb_train, yb_train)
    train_score = pipeline.score(Xb_train, yb_train)
    val_score = pipeline.score(Xb_val, yb_val)
    
    train_scores.append(train_score) 
    val_scores.append(val_score)

**Training score**

In [33]:
bs_train = pd.Series(train_scores)
print(f'validation score: {bs_train.mean():.6} \u00B1 {bs_train.std():.6}')

validation score: 0.873622 ± 0.0141658


**Validation score**

In [None]:
bs_val = pd.Series(val_scores)
print(f'validation score: {bs_val.mean():.6} \u00B1 {bs_val.std():.6}')

In [None]:
plt.rcParams['figure.figsize'] = (8,3)
sns.histplot(bs_val);

**80% confidence interval**

In [None]:
bs_val.quantile(q=[0.1,0.9])

**90% confidence interval**

In [None]:
bs_val.quantile(q=[0.05,0.95])

**99% confidence interval**

In [None]:
bs_val.quantile(q=[0.005,0.995])

***In the real world, we want to compare multiple models + hyperparameters:*** `GridSearchCV`

<hr style="border:2px solid black">

## References

- [Machine Learning Fundamentals: Cross Validation](https://www.youtube.com/watch?v=fSytzGwwBVw)
- [A Gentle Introduction to the Bootstrap Method](https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/)