# Introduction
<hr style="border:2px solid black"> </hr>


**What?** Nested cross-validation



# Import modules
<hr style="border:2px solid black"> </hr>

In [2]:
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
import numpy as np
from pylab import rcParams
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Create dataset
<hr style="border:2px solid black"> </hr>

In [None]:
# create synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20,
                           random_state=1, n_informative=10, n_redundant=10)

# Non-nested vs. nested CVs
<hr style="border:2px solid black"> </hr>


- **NON-NESTED** = estimates the generalization error of the underlying model and its (hyper)parameter search.
Model selection without nested CV uses the same data to tune model parameters and evaluate model performance. 
Information may thus “leak” into the model and overfit the data. The magnitude of this effect is primarily 
dependent on the size of the dataset and the stability of the model. Choosing the parameters that maximize 
non-nested CV biases the model to the dataset, yielding an overly-optimistic score.


- **NESTED** = cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. 
To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. In the inner loop
(here executed by GridSearchCV), the score is approximately maximized by fitting a model to each training set, and
then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop (here in 
cross_val_score), generalization error is estimated by averaging test set scores over several dataset splits



# Manual nested CV


- Manual nested cross-validation for random forest on a classification dataset.

- Importantly, we can configure the hyperparameter search to refit a final model with the entire training 
dataset using the best hyperparameters found during the search. This can be achieved by setting the `refit=True`, then retrieving the model via the 'best_estimator_' attribute on the search result.

- We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 
combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation,
resulting in (10 * 9 * 3) or 270 model evaluations. 



In [None]:
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for train_ix, test_ix in cv_outer.split(X):
    # split data
    X_train, X_test = X[train_ix, :], X[test_ix, :]
    y_train, y_test = y[train_ix], y[test_ix]
    # configure the cross-validation procedure
    cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
    # define the model
    model = RandomForestClassifier(random_state=1)
    # define search space
    space = dict()
    space['n_estimators'] = [10, 100, 500]
    space['max_features'] = [2, 4, 6]
    # define search
    search = GridSearchCV(model, space, scoring='accuracy',
                          cv=cv_inner, refit=True)
    # execute search
    result = search.fit(X_train, y_train)
    # get the best performing model fit on the whole training set
    best_model = result.best_estimator_
    # evaluate model on the hold out dataset
    yhat = best_model.predict(X_test)
    # evaluate the model
    acc = accuracy_score(y_test, yhat)
    # store the result
    outer_results.append(acc)
    # report progress
    print('>acc=%.3f, est=%.3f, cfg=%s' %
          (acc, result.best_score_, result.best_params_))
# summarize the estimated performance of the model
print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100}
>acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100}
>acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500}


# Automated nested CV


- Automatic nested cross-validation for random forest on a classification dataset.
- A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. 
- This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop. This greatly 
reduces the amount of code required to perform the nested cross-validation.



In [None]:
# create dataset
X, y = make_classification(n_samples=1000, n_features=20,
                           random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
# define the model
model = RandomForestClassifier(random_state=1)
# define search space
space = dict()
space['n_estimators'] = [10, 100, 500]
space['max_features'] = [2, 4, 6]
# define search
search = GridSearchCV(model, space, scoring='accuracy',
                      n_jobs=1, cv=cv_inner, refit=True)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# execute the nested cross-validation
scores = cross_val_score(
    search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# Conclusions
<hr style="border:2px solid black"> </hr>



- We can sew how the estimated accuracies are different, but similar. 
- We can also see that different hyperparameters are found on each iteration, showing how hyperparameters can dependent on the specifics of the dataset.




# References
<hr style="border:2px solid black"> </hr>


- https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/

