# Cross-Validation
**Cross-Validation**: a technique used to evaluate the generalization performance of a model. This is done by iteratively splitting the original training dataset into smaller subsets of training and testing data, which is then fit on a model.

The purpose is not to train a model, but to evaluate the generalization performance of a model while avoiding overfitting and underfitting. Specifically, overfitting occurs by training on the entire dataset without some subset of validation; cross-validation avoids this by creating variations of training-validation sets. Underfitting is avoided - which inherently occurs when smaller datasets are used - by using multiple iterations of different datasets.

Note: there should always be a final test set that is used for final evaluation.

### Cross-Validation Iterators

**KFold**: divides all samples from the original dataset into `k` groups of equally-sized samples (called folds). Each iteration trains on `k - 1` folds (and tests on `1` fold); thus, there are `k` iterations. This assumes data is iid.

The data itself is never shuffled, but, when `shuffle = True`, the selected test set will be drawn randomly without repetition (thus there is no overlap of test sets between splits). Samples within each split are never shuffled.

In [18]:
from sklearn.model_selection import KFold

X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
kfold = KFold(n_splits=4, shuffle=True)
for train, test in kfold.split(X):
    print(f"Train: {train}, Test: {test}\n")

Train: [ 0  1  2  3  4  6  7  9 10], Test: [ 5  8 11]

Train: [ 0  3  4  5  7  8  9 10 11], Test: [1 2 6]

Train: [ 1  2  3  5  6  7  8 10 11], Test: [0 4 9]

Train: [ 0  1  2  4  5  6  8  9 11], Test: [ 3  7 10]



**ShuffleSplit**: divides the data into user-defined number of independent train-test dataset splits. Samples are first shuffled and then split into their sets, allowing for testing splits to overlap.

In [19]:
from sklearn.model_selection import ShuffleSplit

X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
shuffle = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)
for train, test in shuffle.split(X):
    print(f"Train: {train}, Test: {test}\n")

Train: [10  2  8  1  7  9  3  0  5], Test: [ 6 11  4]

Train: [ 4  9  0 11  7  6  1 10  8], Test: [5 2 3]

Train: [ 2  7  5 11  0  3  4  9  8], Test: [ 6  1 10]



**LeaveOneOut**: each training set is created by taking all samples except for one, which is left out as a testing sample. Models between splits are virtually identical to one another, thus, it is recommended to use other approaches. This can be beneficial if there is a small amount of data.

### Cross-Validation Stratification
**Stratification**: a method to ensure that each dataset has approximately the same percentage of samples of each target class as the complete dataset. Naturally, when data is split into folds, there could be models within the cross-validation process that are exposed to more or less of a certain target class when not using stratification. This is particularly helpful for data that is class imbalanced.

Options include:
- `StratifiedKFold`
- `StratifiedShuffleSplit`

### Cross-Validation Grouping
**Grouping**: grouping occurs when a model is flexible enough to learn specific features from a particular group. For example, the model overfits as it learns about data from one patient from a medical dataset, where multiple sames can belong to the same patient (group). Using the grouping version of an iterator ensures the same group is not represented in *both* the testing and training sets for a given split.

Options include:
- `GroupKFold`
- `StratifiedGroupKFold` if class proportions must be balanced
- `LeaveOneGroupOut`
- `LeavePGroupsOut`
- `GroupShuffleSplit`

### Cross-Validation Metrics

`cross_val_score` returns the `score` method of the estimator for each fold of the cross-validation. This can be changed using the `scoring` parameter.

In [20]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

data, target = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
scores = cross_val_score(
    LogisticRegression(),
    X_train,
    y_train,
    cv=5,
    scoring="accuracy"
)

print(f"Scores: {scores}")

Scores: [0.91666667 1.         0.95833333 1.         1.        ]


`cross_validate` enables the use of multiple evaluation metrics, and contains detailed information like fit times, score times, and the test score.

In [22]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate

data, target = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
cv_results = cross_validate(
    LogisticRegression(),
    X_train,
    y_train,
    cv=5,
    scoring="accuracy"
)

print(f"Fit times: {cv_results["fit_time"]}")
print(f"Score times: {cv_results["score_time"]}")
print(f"Test scores: {cv_results["test_score"]}")

Fit times: [0.00675011 0.00489306 0.00547194 0.00482082 0.00519609]
Score times: [0.00051475 0.00044799 0.00034928 0.0003171  0.00029016]
Test scores: [0.91666667 0.95833333 1.         0.95833333 0.95833333]
