# Model Validation

We've done a lot of work with our linear models this week, but what is this all for... and will it even work?

![I Love Lucy shrug gif from Giphy](https://media.giphy.com/media/JRhS6WoswF8FxE0g2R/giphy.gif)

To answer those questions, let's think through the point of modeling. The below diagram outlines the CRISP-DM version of the Data Science Process - a nice way to break down data science into its component pieces.

Looking at this diagram, what is the end result? What is all of this for?

![CRISP-DM Process diagram, from stellar consulting](https://www.stellarconsulting.co.nz/wp-content/uploads/2017/08/CRISP-DM_Process_1000x600.jpg)

For most models to be useful, they must be used - on real-world, unseen, potentially real-time data that goes beyond the data we have available when we are training models. But how in the world can we know if a model will work on real-world data?

In other words, how do we know if a model is **_generalizable_**?

## Learning Objectives

- Recognize why validation is important
- Describe how a train-test split works
- Apply a train-test split to a dataset using sklearn
- Explain why k-fold cross validation is often more robust than a single train-test split
- Apply k-fold cross validation to a dataset using sklearn

## Model Validation

Let's say you have a dataframe, with some number of rows of data, and that's all you have available to you. The hope is that you can train a model on this data that can then be used to make predictions about new data that comes in. You want your model to generalize well and work on this incoming data. How can you be sure it does so? 

### Train-Test Split

The idea: don't train your model on ALL of your data, but keep some of it in reserve to test on, in order to simulate how it will work on new/incoming data.

#### Example:

![original image from https://www.dataquest.io/wp-content/uploads/kaggle_train_test_split.svg plus some added commentary](images/traintestsplit_80-20.png)

Note - here, it looks like we're just taking the tail end of the dataset and setting it aside. In practice (most of the time), the split will randomly choose which rows are in the train vs. test sets.

#### Practice:

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html