<img src="./imgs/GA-logo.png" style="float: left; margin: 20px; height: 55px">

# Cross-Validation Lesson

_Authors: Dave Yerrington (SF), Joseph Nelson (DC), Kiefer Katovich (SF), Riley Dallas(AUS), Adi Bronshtein (Live Online), Jeff Hale (Live Online)_

---

### Learning Objectives
- **Describe** train/test split
- **Describe**  cross-validation.
- **Explain** how these validation techniques differ and why we want to use them.
- **Split** data into testing and training sets using train/test split and train, evaluate, and interpret
- **Use** cross-validation to choose your hyperparameters

## Overfitting and Underfitting

---

![](imgs/overfitting.jpg)


**What's wrong with the first model?**


**What's wrong with the third model?**


**The middle "Goldilocks" model is a good compromise.**
- It approximates the complexity of the true model and does not model random noise in our sample as true relationships.

![](imgs/overfitting2.png)

#### Import libraries


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns


<a id='demo'></a>

#### Load the Data

---

Today's [dataset](http://faculty.marshall.usc.edu/gareth-james/ISL/data.html) (`Advertising.csv`) is from the [ISLR website](http://faculty.marshall.usc.edu/gareth-james/ISL/index.html). 

Drop `Unnamed: 0` once you've loaded the csv into a `DataFrame`.

In [None]:
df = pd.read_csv('http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv')


## Data cleaning
---

1. Do we have any null values?


#### EDA: Plot a Heatmap of the Correlation Matrix
---

Heatmaps are an effective way to visually examine the correlational structure of your predictors. 

#### EDA: Use seaborn's `.pairplot()` method to create scatterplots for each of our columns
---

<a id='x-y'></a>

## Create our features matrix (`X`) and target vector (`y`)
---

The following columns will be our features:
- `TV`
- `radio`
- `newspaper`

The `sales` column is our label: the column we're trying to predict.

Create your `X` and `y` variables.

In [None]:
features = ['TV', 'radio', 'newspaper']

<a name="train-test-split"></a>
## Train/Test Split

---

What would a good model do?

Predict values for y that are close to the true values for y - WITH DATA IT HASN'T SEEN!

This is called _generalizing to new data_. 

### If your model doesn't generalize well, it's not good.

---

So far we've fit our models with ALL our data. So we haven't tried to generalize to unseen data.

One popular method for evaluating how well our model generalizes is to break our data into 2 parts:

This basic is called **train/test split**.

Split our data into two parts:

> **"A Train Set":** The subset of the data on which we fit our model.

> **"A Test Set":** The subset of the data on which we EVALUATE the quality of our model.

<a id='sklearn-tts'></a>

## Scikit-Learn's `train_test_split` function
---

This is tattoo worthy. 💪

#### What's the default test set percentage?

### What do the resulting variables look like?

### Are they rows now randomized?

## Linear Regression model
---

Create a `LinearRegression` model

Fit it to your **training data** (`X_train`, `y_train`)

#### Make predictions on the test set. You are using X_test to predict what you think the y values should be.

### Model Evaluation
---

Now use `.score` or a sklearn scoring function such as `mean_squared_error` to evaluate your model's performance on the test data. (how close are your y predictions to the ground truth values for y?)

#### Score on the training set

![](imgs/I-am-something-of.jpg)

You care about the test set score. That's the important thing. 

The training set score gives you a hint to as to whether your model is underfitting or overfitting. 


Let's look at the training set score.

It's rare that the MSE is higher for the train set than the test set. Usually the MSE is higher for the test set.

___

---
<a id='cross-val-k-fold'></a>

# K-Fold Cross-Validation

---

K-fold cross-validation takes the idea of a single train/test split and expands it to *multiple tests* across different train/test splits of your data.

For example, if you determine your training set will contain 80 percent of the data and your testing set will contain the other 20 percent, you could have five different 80/20 splits in which the test set in each is a different set of observations. We have:
- Five (K=5) training sets.
- Five (K=5) corresponding testing sets.

**K-fold cross-validation builds K models — one for each train/test pair — and evaluates those models on each respective test set.**

### K-Fold Cross-Validation Visually with K = 3

<img src="https://snag.gy/o1lLcw.jpg?convert_to_webp=true" width="500">

---

Cross-validation helps us understand how a model might perform in a variety of cases. 

## K-Folds Cross Validation in `sklearn`
---

Now let's try out k-fold cross-validation. scikit-learn provides a useful function

With a regression problem, `cross_val_score` returns the $R^2$ for each of the orange sections above by default. 


What does each of those values represent?

What might be some nice summary statistics?

## TTS with cross validation setup

![](https://media.giphy.com/media/3bbcAXSqdmkqBQ22qw/giphy.gif)

### Steps:
1. Split dataset into test/holdout set and training set using `train_test_split`. Don't touch the holdout/test set until the end. ⚠️
1. Then do cross validation on the remaining 80% of the data (splitting it 5 different ways)
1. Look at your validation scores (take the mean)
1. Repeat for other models/hyperparameters
1. Fit the best model on all the training data
1. Score the best model on the holdout/test dataset


### We need to use cross validation first and then use our test/holdout set 
We don't want information from the test set to leak into the training set when choosing our best model. 

![

### Let's do TTS again and THEN do cross validation on the remaining training set


In [None]:
X_test.head(2)

In [None]:
y_train.head(2)

In [None]:
y_test.head(2)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

##### Instantiate a linear regression model

#### Use `cross_val_score` with the training data

You get back an array with the R2 for each of the validation sets.

Let's take the mean.

#### Let's compare a different model

Let's try a lienar regression model without the intercept

In [None]:

# usually we'll do more interesting hyperparameter tuning

#### How did that model do?

### Fit the best model on ALL the training data

Score the best model on the holdout/test set

#### Now, let's see how our best model does with data it has never seen before (new data)! 🚀

#### How does that compare to the R2 score on the whole training data?

The difference between score on the training and test sets isn't large, so our model is not overfitting to the training data.

## Exercise

### Compare three models to find the best performing model

1. with only TV and Radio as predictors of Sales.
1. with only Radio as a predictor of Sales.
1. with only TV as a predictor of Sales.

### Steps

1. Set up X and y
1. Split the data with `train_test_split` with a random state of 1.
1. Use `cross-val_score` to check which model works best.
1. Fit the best-performing model on all the training data.
1. Score the best-performing model on the hold-out/test dataset.
1. Which model performed best?
1. Use R2 as your evaluation metric.



--- 
## Summary

You will use `train_test_split()` all the time in your machine learning modeling workflow. You create your test/holdout set and put it to the side so it doesn't influence your model in any way.

You will rarely use `cross_val_score()` directly with your training and validation sets. But you will use scikit-learn's `GridSearchCV` for your training and validation sets all the time. It does the same thing as `cross_val_score`, but is nicer to use. You'll learn about that soon. 👍

---
## Check for understanding
- What is the purpose of splitting a dataset into a training set and a test set?
- What evaluation metric score do you care most about - the training set, validation set, or test set?
- What is the advantage of using cross-validation to tune your hyperparameters instead of train_test_set?