# Cross-Validation with `sklearn`

On the official scikit-learn [page](https://scikit-learn.org/stable/modules/cross_validation.html), there is an extensive presentation of cross-validation, with many details and examples. Please consult this page for more information.

## CV and Tuning for a Random Forest Classifier

In this example we will study a random forest classifier for synthetic data, generated by the scikit's function `make_classification`. We will combine hyperparameter tuning with model selection, using so-called **nested cross-validation**, where the procedure allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

> In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use. The model selection process is performed independently within each fold of the resampling procedure.

There is a price to pay. If $n \times k$ models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to $k \times n \times k$ as the procedure is then performed $k$ more times for each fold in the **outer loop** of nested cross-validation.

It is common to use $k=10$ for the outer loop and a smaller value of $k$ for the inner loop, such as $k=3$ or $k=5.$

Nested cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. Nested CV estimates the generalization error of the underlying model and its (hyper)parameter search. Choosing the parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic score.

Model selection without nested CV uses the same data to tune model parameters and evaluate model performance. Information may thus “leak” into the model and overfit the data. The magnitude of this effect is primarily dependent on the size of the dataset and the stability of the model. S

To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. 

- In the inner loop (executed by `GridSearchCV`), the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. 
- In the outer loop (here in `cross_val_score`), generalization error is estimated by averaging test set scores over several dataset splits.

The final model is configured and fit using the procedure applied in one pass of the outer loop, e.g. the outer loop applied to the entire dataset.

This model can then be used to make predictions on **new data.** We know how well it will perform on average based on the score provided during the final model tuning procedure.

**NOTE**

Thanks to scikit-learn, the resulting code is very compact and efficient, though we could expand/explicit each step manually if really necessary...


### Configuration

- We will tune two hyperparameters with three values each, i.e. (3 * 3) 9 combinations. 
- We will use 10 folds in the outer cross-validation and 
- three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.



In [1]:
# Automatic nested cross-validation for random forest 
# performed on a synthetic classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

# Configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)

# Define the RF model
model = RandomForestClassifier(random_state=1)

# Define grid for search space
space = dict()
space['n_estimators'] = [10, 100, 500]
space['max_features'] = [2, 4, 6]

# Define inner CV search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)

# Configure the outer cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)

# Execute the nested cross-validation - this is the OUTER loop
scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)

# Report the final performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.929 (0.020)
