# Cross-Validation
**Cross-Validation**: a technique used to evaluate the generalization performance of a model (important: the purpose is not to train a model).

This is done by taking the training set, splitting it into some smaller sets for a number of iterations, where a new model is fit each time. The model for each iteration will be evaluated using this internally-created validation set.

Cross-validation avoids *overfitting* by using the classic approach of a training/validation split approach before testing on the test set. It also avoids *underfitting* - which occurs by having less samples to train on when splitting the training set into smaller training and validation sets - by using multiple iterations of different sets from a single dataset.

**There should always be a separate, final test set that is used for final evaluation.**

### Cross-Validation Strategies

`KFold`: divides all samples in the dataset into `k` groups of samples (called folds), of equal sizes (if possible). Then, each iteration will train on `k-1` folds and test on `1` fold. Thus, there will be `k` iterations. *This assumes that data is independently and identically distributed.*

Here is an example of a `KFold(n_splits=3, shuffle=True)`

```
Data: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

Fold 1: 7, 2, 9, 1
Fold 2: 10, 3, 6, 4
Fold 3: 0, 11, 5, 8

- Split 1 -
Train: 10, 3, 6, 4, 0, 11, 5, 8
Test: 7, 2, 9, 1

- Split 2 -
Train: 7, 2, 9, 1, 0, 11, 5, 8
Test: 10, 3, 6, 4

- Split 3 -
Train: 10, 3, 6, 4, 7, 2, 9, 1
Test: 0, 11, 5, 8
```

`RepeatedKFold` repeats `KFold` for `n` times, producing different splits in each repetition.

`StratifiedKFold` returns stratified folds, meaning each set contains approximately the same percentage of samples of each target class as the complete set. When data is split into folds, there could be models that are exposed to more or less of a certain target class.

`ShuffleSplit`: divides the data into user-defined number of independent train and test dataset splits. Samples are first shuffled, then split. Unlike `KFold`, splits could overlap.

### Cross-Validation Methods

`cross_val_score` returns the `score` method of the estimator for each fold of the cross-validation. This can be changed using the `scoring` parameter.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

data, target = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
scores = cross_val_score(
    LogisticRegression(),
    X_train,
    y_train,
    cv=5,
    scoring="accuracy"
)

print(f"Scores: {scores}")

`cross_validate` enables the use of multiple evaluation metrics, and contains detailed information like fit times, score times, and the test score.

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate

data, target = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
cv_results = cross_validate(
    LogisticRegression(),
    X_train,
    y_train,
    cv=5,
    scoring="accuracy"
)

print(f"Fit times: {cv_results["fit_time"]}")
print(f"Score times: {cv_results["score_time"]}")
print(f"Test scores: {cv_results["test_score"]}")