# Cross Validation
It's a `model validation` technique used to test a model's ability to accurately predict new data that wasn't used in the training period. This helps counter problems of `overfitting` and `selection bias`.

In [35]:
from sklearn.datasets import load_breast_cancer
import sklearn

data = load_breast_cancer()
X, y = data["data"], data["target"]

#### K Folds cross validation
It generally following this structure: 
<ol>
    <li>shuffle data</li>
    <li>split data into <b>k</b> groups — generally tries to split </li> 
    <li>for each set of groups
        <ol>
            <li>take a group as a <i>hold out</i></li>
            <li>take the remaining data as the training data</li>
            <li>fit a model on the training set, and evaluate on the test set</li>
            <li>keep track of score and discard model</li>
        </ol>
    <li>summarize the performance across all <b>k</b> folds</li>
</ol>

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.model_selection import KFold
import numpy as np

kf = KFold(5)
print(f"number of splits: {kf.get_n_splits(X, y)}")

clf = LogisticRegression(solver='lbfgs', max_iter=1e5)
errors = np.zeros(kf.get_n_splits(X, y))

for i, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index]
    
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    errors[i] = precision_score(preds, y_test)

print(errors.mean())

number of splits: 5
0.9729645951146966


#### Statified K Fold cross validation
It's a variant of the `k-fold` fold method that returns `stratified` folds — meaning it helps make sure that each fold is representative of the entire population. 

In [43]:
from sklearn.model_selection import StratifiedKFold

clf = LogisticRegression(solver='lbfgs', max_iter=1e5)

skf = StratifiedKFold(5)
print(f"number of splits: {skf.get_n_splits(X, y)}")

errors = np.zeros(skf.get_n_splits(X, y))

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index]
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    errors[i] = precision_score(preds, y_test)

print(errors.mean())

number of splits: 5
0.9719092331768389


### Resources: 
* [Wikipedia — Cross Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))
* [ML Mastery — Gentle Introduction to K-Fold Cross Validation](https://machinelearningmastery.com/k-fold-cross-validation/)