# Cross Validation
## The difference between a good model and an overfit one!
This notebook goes along with a youtube created on the topic on Rob's youtube channel. [Check out the channel and video here.](https://www.youtube.com/channel/UCxladMszXan-jfgzyeIMyvw)

![notsure](resources/shutterstock_493554238.jpg)


We will learn:
1. Example where not using cross validation causes us to believe our model is better than it is.
2. Show the different cross validation techniuqes and when to apply them.
3. Run our example from part 1 using the correct cross validation technique.

In [90]:
import pandas as pd
import numpy as np

import matplotlib.pylab as plt
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, roc_auc_score
from typing import List

from sklearn.model_selection import (
    train_test_split,
    TimeSeriesSplit,
    KFold,
    StratifiedKFold,
    GroupKFold,
    StratifiedGroupKFold
)
plt.style.use('seaborn-white')

ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (C:\Users\Usuario\anaconda3\lib\site-packages\sklearn\model_selection\__init__.py)

# The Dataset

Stroke Prediction Data. Using information about patients, predict if they are likely to have a stroke.
- Gender, marital status, smoking status, age, etc.
- Also have a "Doctor" feature added to represent the "group" within the data.

In [None]:
def get_prep_data():
    df = pd.read_csv(
        "./resources/healthcare-dataset-stroke-data.csv"
    )
    df["ever_married"] = (
        df["ever_married"].replace("Yes", True).replace("No", False)
    )
    df["gender"] = df["gender"].astype("category")
    df["smoking_status"] = df["smoking_status"].astype("category")
    df["Residence_type"] = df["Residence_type"].astype("category")
    df["work_type"] = df["work_type"].astype("category")
    df["doctor"] = np.random.randint(0, 8, size=len(df))
    
    return df


df = get_prep_data()

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
def encode_features(df: pd.DataFrame, features: List[str]):
    le = LabelEncoder()
    for feat in features:
        df[feat] = le.fit_transform(df[feat])

In [None]:
encode_features(df, ['gender', 'work_type', 'Residence_type', 'smoking_status'])

In [None]:
df.head()

In [None]:
df.isna().any()

In [None]:
df['bmi'].fillna(-1, inplace=True)
df.head()

In [None]:
def split_data(df):
    holdout_ids = df.sample(n=500, random_state=529).index
    
    train = (
        df.loc[~df.index.isin(holdout_ids)]
        .sample(frac=1, random_state=529)
        .sort_values("doctor")
        .reset_index(drop=True)
    )
    holdout = (
        df.loc[df.index.isin(holdout_ids)]
        .sample(frac=1, random_state=529)
        .sort_values("doctor")
        .reset_index(drop=True)
    )

    return train, holdout

train, holdout = split_data(df)

In [None]:
def get_X_y(train):
    FEATURES = [
        "gender",
        "age",
        "hypertension",
        "heart_disease",
        "ever_married",
        "work_type",
        "Residence_type",
        "avg_glucose_level",
        "bmi",
        "smoking_status",
    ]

    GROUPS = "doctor"

    TARGET = "stroke"

    X = train[FEATURES]
    y = train[TARGET]
    groups = train[GROUPS]
    return X, y, groups

In [None]:
X, y, groups = get_X_y(train)
clf = SVC()
clf.fit(X, y)

In [None]:

# Predict on training set
pred = clf.predict(X)
# pred_prob = clf.predict_proba(X)[:, 1]

acc_score = accuracy_score(y, pred)
# auc_score = roc_auc_score(y, pred_prob)

print(f'The score on the training set is accuracy: {acc_score:0.4f}')

# Model can predict with 99% accuracy!!!
- NOPE!

# Check on a holdout set

In [None]:
X_holdout, y_holdout, groups_holdout = get_X_y(holdout)

pred = clf.predict(X_holdout)
# pred_prob = clf.predict_proba(X_holdout)[:, 1]
acc_score = accuracy_score(y_holdout, pred)
# auc_score = roc_auc_score(y_holdout, pred_prob)
print(
    f"Our accuracy on the holdout set is {acc_score:0.4f}"
)

## Baseline
Predicting all zeros

In [None]:
acc_score = accuracy_score(y_holdout, np.zeros_like(y_holdout))
auc_score = roc_auc_score(y_holdout, np.zeros_like(y_holdout))
print(
    f"Our baseline on the holdout set is {acc_score:0.4f} and AUC is {auc_score:0.4f}"
)

# Train/Test Split


![traintest](https://www.researchgate.net/profile/Brian-Mwandau/publication/325870973/figure/fig6/AS:639531594285060@1529487622235/Train-Test-Data-Split.png)

Split the training data into a training and validation set. Train the model on the training set, and validate it on the validation set.
- The most basic way of splitting data.
- shuffle - Good idea to use to make sure the order isn't impacting your split.
- stratified (even distribution of positive samples in each set). Consider using if you have a small or unbalanced dataset.

In [None]:
X, y, groups = get_X_y(train)

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.1)
clf = SVC()
clf.fit(X_tr, y_tr)
pred = clf.predict(X_val)
acc_score = accuracy_score(y_val, pred)
print(
    f"Our accuracy on the validation set is {acc_score:0.4f}"
)

# Cross Validation!

Visualizations adapted from [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py)

In [None]:
from matplotlib.patches import Patch
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm

def visualize_groups(classes, groups, name):
    # Visualize dataset groups
    fig, ax = plt.subplots()
    ax.scatter(
        range(len(groups)),
        [0.5] * len(groups),
        c=groups,
        marker="_",
        lw=50,
        cmap=cmap_data,
    )
    ax.scatter(
        range(len(groups)),
        [3.5] * len(groups),
        c=classes,
        marker="_",
        lw=50,
        cmap=cmap_data,
    )
    ax.set(
        ylim=[-1, 5],
        yticks=[0.5, 3.5],
        yticklabels=["Data\ngroup", "Data\nclass"],
        xlabel="Sample index",
    )


def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=25):
    """Create a sample plot for indices of a cross-validation object."""

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(
            range(len(indices)),
            [ii + 0.5] * len(indices),
            c=indices,
            marker="_",
            lw=lw,
            cmap=cmap_cv,
            vmin=-0.2,
            vmax=1.2,
        )

    # Plot the data classes and groups at the end
    ax.scatter(
        range(len(X)), [ii + 1.5] * len(X), c=y, marker="_", lw=lw, cmap=cmap_data
    )

    ax.scatter(
        range(len(X)), [ii + 2.5] * len(X), c=group, marker="_", lw=lw, cmap=cmap_data
    )

    # Formatting
    yticklabels = list(range(n_splits)) + ["class", "group"]
    ax.set(
        yticks=np.arange(n_splits + 2) + 0.5,
        yticklabels=yticklabels,
        xlabel="Sample index",
        ylabel="CV iteration",
        ylim=[n_splits + 2.2, -0.2],
        xlim=[0, 100],
    )
    ax.set_title("{}".format(type(cv).__name__), fontsize=15)
    return ax


def plot_cv(cv, X, y, groups, n_splits=5):
    this_cv = cv(n_splits=n_splits)
    fig, ax = plt.subplots(figsize=(15, 5))
    plot_cv_indices(this_cv, X, y, groups, ax, n_splits)

    ax.legend(
        [Patch(color=cmap_cv(0.8)), Patch(color=cmap_cv(0.02))],
        ["Testing set", "Training set"],
        loc=(1.02, 0.8),
    )
    plt.tight_layout()
    fig.subplots_adjust(right=0.7)
    plt.show()
    
def get_fake_X_y():
    # Fake Generate the class/group data for an example
    n_points = 100
    X_ = np.random.randn(100, 10)

    percentiles_classes = [0.1, 0.9]
    y_ = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

    # Evenly spaced groups repeated once
    groups_ = np.hstack([[ii] * 10 for ii in range(10)])
    return X_, y_, groups_

# KFold
- Split dataset into k consecutive folds (without shuffling by default).

In [None]:
kf = KFold()
X_, y_, groups_ = get_fake_X_y()
plot_cv(KFold, X_, y_, groups_)

# Stratified KFold
- KFold but the folds are made by preserving the percentage of samples for each class.

In [None]:
skf = StratifiedKFold()
X_, y_, groups_ = get_fake_X_y()
plot_cv(StratifiedKFold, X_, y_, groups_)

# Group KFold

The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

In [None]:
gkf = GroupKFold()
X_, y_, groups_ = get_fake_X_y()
plot_cv(GroupKFold, X_, y_, groups_)

In [None]:
plot_cv(GroupKFold, )

# Stratified Group KFold

The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class as much as possible given the constraint of non-overlapping groups between splits.

In [None]:
# gskf = StratifiedGroupKFold()
# X_, y_, groups_ = get_fake_X_y()
# np.random.shuffle(y_)
# plot_cv(StratifiedGroupKFold, X_, y_, groups_)

# Time Series Split

In [None]:
tss = TimeSeriesSplit()
X_, y_, groups_ = get_fake_X_y()
np.random.shuffle(y_)
plot_cv(TimeSeriesSplit, X_, y_, groups_)

# Our Example Using Proper Cross Validation
1. Small/imblanced -> Stratified
2. Group
3. Shuffle in on

`StratifiedGroupKFold` is a good choice for this situation.

In [None]:
sgk = StratifiedKFold(n_splits=5, shuffle=True, random_state=529)

X, y, groups = get_X_y(train)

fold = 0
aucs = []
for train_idx, val_idx in sgk.split(X, y, groups):
    X_tr = X.loc[train_idx]
    y_tr = y.loc[train_idx]
    
    X_val = X.loc[val_idx]
    y_val = y.loc[val_idx]

    # Fit Model on Train
    clf = SVC()
    clf.fit(X_tr, y_tr)
    pred = clf.predict(X_val)
    acc_score = accuracy_score(y_val, pred)
    print(f"======= Fold {fold} ========")
    print(
        f"Our accuracy on the validation set is {acc_score:0.4f}"
    )
    fold += 1
    aucs.append(acc_score)
oof_auc = np.mean(aucs)
print(f'Our out of fold AUC score is {oof_auc:0.4f}')

Our averaged out of fold score is a much better estimation of how our model will perform on unseen data.