### Cross Validation   
When adjusting models we aim to increase overall model performance on unseen data.  
Hyperparameter tuning can lead to much better performance on test sets.  
However, optimizing parameters to the test set can make the model to preform worse on unseen data.  
To correct for this we can perform cross validation.   

There are many methods to cross validation.  
We will discuss _k-fold cross validation_ using a _DecisionTreeClassifier_.   

In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score

In [2]:
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)

#### K-Fold   
The training data used in the model is split, into k number of smaller sets, to be used to validate the model.  
The model is then trained on k-1 folds of training set.  
The remaining fold is then used as a validation set to evaluate the model.  

General procedure:   
1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:  
    - Take the group as a hold out or test data set  
    - Take the remaining groups as a training data set   
    - Fit a model on the training set and evaluate it on the test set   
    - Retain the evaluation score and discard the model  
    
Summarize the skill of the model using the sample of model evaluation scores  

![](../Figures/cross_validate.png)

![](../Figures/cross_validate_.png)

In [3]:
clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)
scores = cross_val_score(clf, X, y, cv = k_folds)

In [4]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
import pandas as pd
# from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split
from sklearn import metrics 

# sample dataset  
X, y = datasets.load_iris(return_X_y=True)

# Create a classifier model and fit. 
clf = DecisionTreeClassifier(random_state=42)
k_folds = KFold(n_splits = 5)
scores = cross_val_score(clf, X, y, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1.         1.         0.83333333 0.93333333 0.8       ]
Average CV Score:  0.9133333333333333
Number of CV Scores used in Average:  5


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9666666666666667


Variations in Cross-Validation techniques
1. K-Fold Cross-Validation
2. Stratified K-Fold Cross-Validation
3. Hold-Out based Validation
4. Leave-One-Out Cross-Validation
5. Group K-Fold Cross-Validation