# Cross-Validation

### Introduction

We may have noticed that in training and then scoring our model, we use a validation set to try out different versions of our model.  In this lesson, we'll see how cross-validation can allow us to assess our model on multiple validation sets.

### Loading the Data

In [1]:
import json

file = './dtypes_listings.json'
with open(file, 'r') as f:
    dtypes = json.load(f)

In [30]:
import pandas as pd

df = pd.read_csv('./listings_target.csv', index_col = 0, dtype = dtypes)

### Showing the problem

This time, let's assume that we only have 1000 observations in our dataset.

In [40]:
df_lim = df[:700]

In [41]:
X = df_lim.drop(columns = ['price', 'log_price']) 
y = df_lim['log_price']

Let's start by splitting our data.

In [42]:
from sklearn.model_selection import train_test_split
X_train, X_validate, y_train, y_validate = train_test_split(X, y, random_state = 1)

And then let's fit our model.

In [44]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train).score(X_validate, y_validate)

0.1827507224880086

We can see that our model's performance drastically decreased with our smaller dataset (from around .50 down to .35), and this is due to our error from variance.  But perhaps this is because we just happened to perform poorly on this particular split of the validation set.  After all, it's only 175 observations.

In [80]:
X_validate.shape

(175, 321)

If we increase the size of the validation set drastically then we will be limiting our training data.  So ideally we would like to be able to assess our model on a larger validation set without decreasing the amount of data our model trains on.

### Introducing K-Fold Cross-Validation

Cross validations aims to help resolve this dillemma.  Here's how it works:  

1. Instead of splitting our data into one training set and one holdout set, we can split the data into divisions, called folds of the data.

<img src="./folds.png" width="50%">

So in the diagram above, we take our training data and split it into five folds.

2.   We'll then train our model on four of the folds and evaluate the model on the other group.  

<img src="./train-3.png" width="50%">

> So above, Fold 1 is our validation set, and we train on the other three folds.

3. Then we retrain the model, this time assigning Fold 2 as our holdout set.

<img src="./cross-val-2.png" width="50%">

>  And we train on folds 1, 3, 4, and 5. 

4. And so on:

<img src="./folds-multiple.png" width="40%">

So this is called k-fold cross validation.  We split our data into "k" number of folds (above 5), and train our model k times.  Each time, we assign a different fold to be the holdout set.

### Cross Validation in Sklearn

There are multiple methods in sklearn that provide for cross validation.  Let's use the Kfold selector. 

In [69]:
from sklearn.model_selection import KFold

kf = KFold(n_splits = 5)

Calling the split method, provides us with a set of splits of our data.

In [71]:
splits = list(kf.split(X))

> Comment and uncomment the cell below.  Notice that we get a tuple of data.  The first element is the training data, starting at index 140, and proceeding through the end of our dataset.  The second element in the tuple represents our test set for that iteration, going from 0 to 139.

In [74]:
#  splits[0]

Then we have the next split.

In [76]:
# splits[1]

So we can train our model by using the indices provided by our split method to select our training and holdout sets.

In [79]:
scores = [LinearRegression().fit(X.iloc[train_set, :], y[train_set]).score(X.iloc[test_set, :], y[test_set])
 for train_set, test_set in kf.split(X)]

scores

[0.13511927670096302,
 -104.13482706793098,
 -9.52308662039414,
 -0.7555052312466928,
 -0.13441471180539621]

So here we can see that given different splits of the data, on average this model did not perform well.  While disappointing, this is valuable information.  It will allow us to move on from this version of the model and choose a different version.