# Model Validation

We've done a lot of work with our linear models this week, but what is this all for...? And will it even work...?

![I Love Lucy shrug gif from Giphy](https://media.giphy.com/media/JRhS6WoswF8FxE0g2R/giphy.gif)

To answer those questions, let's think through the point of modeling. The below diagram outlines the CRISP-DM version of the Data Science Process - a nice way to break down data science into its component pieces.

Looking at this diagram, what is the end result? What is all of this for?

![CRISP-DM Process diagram, from stellar consulting](https://www.stellarconsulting.co.nz/wp-content/uploads/2017/08/CRISP-DM_Process_1000x600.jpg)

For most models to be useful, they must be used - on real-world, unseen, potentially real-time data that goes beyond the data we have available when we are training models. But how in the world can we know if a model will work on real-world data?

In other words, how do we know if a model is **_generalizable_**?

## Learning Objectives

- Recognize why validation is important
- Describe how a train-test split works
- Apply a train-test split to a dataset using sklearn
- Explain why k-fold cross validation is often more robust than a single train-test split
- Apply k-fold cross validation to a dataset using sklearn

## Model Validation

Let's say you have a dataframe, with some number of rows of data, and that's all you have available to you. The hope is that you can train a model on this data that can then be used to make predictions about new data that comes in. You want your model to generalize well and work on this incoming data. How can you be sure it does so? 

### Train-Test Split

The idea: don't train your model on ALL of your data, but keep some of it in reserve to test on, in order to simulate how it will work on new/incoming data.

#### Example:

![original image from https://www.dataquest.io/wp-content/uploads/kaggle_train_test_split.svg plus some added commentary](images/traintestsplit_80-20.png)

Note - here, it looks like we're just taking the tail end of the dataset and setting it aside. In practice (most of the time), the split will randomly choose which rows are in the train vs. test sets.

#### Practice:

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In general - please replace `None` with the appropriate code below!

In [78]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [79]:
df = pd.read_csv('data/kc_house_data.csv', parse_dates=[1])

Very quick data exploration:

In [80]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,2014-10-13,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,2014-12-09,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,2015-02-25,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,2014-12-09,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,2015-02-18,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
id               21613 non-null int64
date             21613 non-null datetime64[ns]
price            21613 non-null float64
bedrooms         21613 non-null int64
bathrooms        21613 non-null float64
sqft_living      21613 non-null int64
sqft_lot         21613 non-null int64
floors           21613 non-null float64
waterfront       21613 non-null int64
view             21613 non-null int64
condition        21613 non-null int64
grade            21613 non-null int64
sqft_above       21613 non-null int64
sqft_basement    21613 non-null int64
yr_built         21613 non-null int64
yr_renovated     21613 non-null int64
zipcode          21613 non-null int64
lat              21613 non-null float64
long             21613 non-null float64
sqft_living15    21613 non-null int64
sqft_lot15       21613 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(15)
memory usage: 3.5 MB


Defining our X and y:

In [153]:
X_cols = [c for c in df.columns.to_list() if 'sqft' not in c]

X = df[X_cols].drop(columns=['id', 'date', 'price'])
y = df['price']

In [154]:
# Train test split here!

# X_train, X_test, y_train, y_test = None

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=13)

What did that do?

In [155]:
X_train.shape

(14480, 12)

In [156]:
X_test.shape

(7133, 12)

In [157]:
len(X_train + X_test) == len(X)

True

Now let's put the split into practice:

In [158]:
# Instantiate an sklearn linear model

# lr = None

lr = LinearRegression()

In [159]:
# Fit your model - ON THE TRAINING DATA!!

lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [160]:
# Grab predictions for train and test set

# y_pred_train = None
# y_pred_test = None

y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

In [161]:
# How'd we do?

# print(f"Train R2 Score: {None}")
# print(f"Test R2 Score: {None}")

print(f"Train R2 Score: {r2_score(y_train, y_pred_train)}")
print(f"Test R2 Score: {r2_score(y_test, y_pred_test)}")

Train R2 Score: 0.6526010277090291
Test R2 Score: 0.6462545503615653


Okay! Not great, but not terrible. But why is our testing score lower than our training score?

![original image from https://rmartinshort.jimdofree.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/](images/underfit-goodfit-overfit.png)