# Overview

In the first sprint for this unit, we introduced the concept of using a validation set. This method provides a way to check or evaluate your model before needing to use a final test set. As a test set isn't always available (such as for a Kaggle competition), validation sets fill in the gap.

But there are some limitations to using a single validation set. One important consideration is that you will get different results (model scores) with different validation sets. A way around this is to use more than one validation set. There are a few different ways to do this.

### Cross-validation

The method of cross-validation is where we divide the data into equal-sized sets; some sets are used for training and the remaining set is used for testing. For each trial, a different set is used for testing. The specific method will discuss in this objective is called *k-fold cross-validation*.

#### K-fold Cross-Validation

For this method, we divide data into *k* equal sets or *folds*; a typical number is 5 or 10 folds. For each fold, we train on k-1 folds and test on the remaining fold. For example, if k=5, you would train on four folds and test on the fifth fold. With each trial, train on the next k-1 folds and test on the remaining fold. For k=5, there will be five trials, with five different test sets, with an accuracy score for each.

specifically k-fold cross-validation. For a quick review, k-fold cross-validation in where the data is divided into *k* equal sets or *folds*. For each fold, we train on k-1 folds and test on the remaining fold. For example, if k=5, you would train on four folds and test on the fifth fold. With each trial, train on the next k-1 folds and test on the remaining fold. For k=5, there will be five trials, with five different test sets, with an accuracy score for each.

What we didn't cover as much was *why* we use cross-validation and how it can be used to select the hyperparameters which result in the best model.

### Cross-validation and Pipelines

As was introduced in lecture, we could hold out a validation set: separate data into training, validation, and testing. Train the model with the training set, validate the model parameters with the validation set, and then do the final testing on the test set.

One issue with this method is that if you need to standardize your variables, and you standardize *before* splitting into training and testing sets, you will inadvertently leak some knowledge to the testing set. For example, if you are standardizing by subtracting the mean and dividing by the standard deviation, your test data will *know* these statistics about the rest of the data.

For cross-validation, if you standardize your data before dividing into k-fold cross-validation sets, your test/validation set in each fold will also know something about the training data. To avoid the problem of data leakage, separate your training/testing set or cross-validation sets and then standardize. The scikit-learn `Pipeline` tool makes this process easy, by applying any preprocessing or standardizing separately to the training and testing data.

In the next section, we will assemble a pipeline and then fit our model, using k-fold cross-validation to determine the accuracy.

## Follow Along

In [1]:
# Import libraries
import numpy as np
from sklearn import datasets
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the digits data

# The deafult with 10 classes (digits 0-9)
digits = datasets.load_digits(n_class=10)

# Create the feature matrix
features = digits.data
print('The shape of the feature matrix: ', features.shape)

# Create the target array
target = digits.target
print('The shape of the target array: ', target.shape)
print('The unique classes in the target: ', np.unique(target))

The shape of the feature matrix:  (1797, 64)
The shape of the target array:  (1797,)
The unique classes in the target:  [0 1 2 3 4 5 6 7 8 9]


In [3]:
# Instantiate the standardizier
standardizer = StandardScaler()

# Instantiate the classifier
logreg = LogisticRegression(max_iter=150)

# Create the pipeline
pipeline = make_pipeline(standardizer, logreg)

# Instantiate the k-fold cross-validation 
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=11)

In [4]:
# Fit the model using k-fold cross-validation
cv_scores = cross_val_score(pipeline, features, target,
                           cv=kfold_cv, scoring='accuracy')

In [5]:
# Print the mean score
print('All cv scores: ', cv_scores)

# Print the mean score
print('Mean of all cv scores: ', cv_scores.mean())

All cv scores:  [0.97222222 0.96944444 0.95543175 0.97493036 0.98050139]
Mean of all cv scores:  0.9705060352831941


Above, we displayed all the scores and also the mean of the scores. 

## Challenge

Look at the number of folds in the k-fold cross validation - how does changing the value affect the accuracy of the model?

## Additional Resources

* [Scikit-learn: Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* [Scikit-learn: Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [Scikit-learn: Digits Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)
* [Scikit-learn: KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
* [Evaluate a score by cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)