<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Train Test Split
_Author: Kevin Coyle_

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe what train test datasets are.
- Explain the motivation behind creating a "testing" set for training our models
- Create train test splits, using SK Learn

## Why Split?

Splitting things is not something most people like to do. Most people want to connect! 

SQL has language that does `joins`, SQL doesn't have language that does `break up then ghost some data`.

Getting invited to that kid's birthday party in 2nd grade is awesome. He had an arcade game IN HIS HOUSE.

Not getting invited to that other kid's party in 3rd grade was a bummer. I thought we all connected so well!

Connecting and joining things seems to be natural in our everyday lives and in our data lives.

Why would we want to split our data then? Is this the cold, rational side of science?

No!

Recall that a major goal of creating a model is for the model to _generalize_ well to unseen data.

However, we do not have unseen data. We just have data from the past. How would we emulate this idea of "data my model hasn't seen" if we only have past, historical data?

### Enter the split.

Common convention is to use UPPERCASE for our X and lowercase of for our y.

Train test split is no different.

Here are the four variables you'll use for training + testing sets:

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [2]:
data = [[1,2,3],
       [4,5,6],
       [7,8,9],
       [10,11,12],
       [13,14,15],
       [16,17,18],
       [19,20,21],
       [22,23,24],
       [25,26,27],
       [28,29,30]]

df = pd.DataFrame(data, columns={'A':'0', 'B':'1', 'C':'2'})
df

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12
4,13,14,15
5,16,17,18
6,19,20,21
7,22,23,24
8,25,26,27
9,28,29,30


In [3]:
feature_cols = ['B', 'C']

X = df[feature_cols]
y = df.A

In [4]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We can control how big we want our testing set (and consequently, our training set as well) to be, by changing the `test_size`. This takes in a float, which corresponds to a percentage of the total data.

## Pop Quiz!!

### Given the following number, how big (as a percentage) is our **testing set** (assuming we're setting this number to the test_size)? 

In [None]:
# .3

### Given the following number, how big (as a percentage) is our **training set**  (assuming we're setting this number to the test_size)?

In [None]:
# .6

### Assume I have a dataset of 100 rows. If I set the`test_size` to the following amount, how many rows are in my **test** set?

Great! Additionally, it's a good move to set a `random_state`.

Which number you choose here is inconsequential. It has nothing to do with the number of rows in your dataset, it has nothing to do with any calculation. 

Most people choose 42 because 42 is the answer to everything.

Random state ensures that each time you create you train and test sets, the same numbers are chosen!

Let's pull all these together by reading in data and checking out the rows selected for the training set, even if we re-run the cell multiple times!

Import pandas + our NCAA data in the "./data" folder

In [5]:
ncaa_df = pd.read_csv('../data/ncaa.csv')

In [6]:
ncaa_df.head()

Unnamed: 0,AST,AST_Diff,BLK,BLK_Diff,Coach,DR,DR_Diff,FGP,FGP3,FGP3_Diff,...,PPG_Diff,Rank,Rank_Diff,Result,SEED_Diff,STL,STL_Diff,Season,Seed,TeamID
0,14.0,4.666667,4.176471,0.676471,mark_gottfried,26.411765,3.911765,0.444393,0.347418,0.04075,...,18.539216,3.0,35,1,9.0,7.235294,1.401961,2003,10.0,1104
1,15.380952,6.047619,4.095238,0.595238,rick_stansbury,26.380952,3.880952,0.495357,0.36198,0.055311,...,17.5,3.0,21,1,4.0,9.285714,3.452381,2003,5.0,1280
2,13.4,4.066667,5.75,2.25,eddie_sutton,24.5,2.0,0.474798,0.382793,0.076124,...,17.233333,3.0,19,1,5.0,9.65,3.816667,2003,6.0,1329
3,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,22.060606,3.0,1,1,0.0,6.954545,1.121212,2003,1.0,1400
4,14.590909,5.257576,3.818182,0.318182,rick_barnes,26.636364,4.136364,0.456127,0.343126,0.036458,...,22.060606,3.0,1,1,0.0,6.954545,1.121212,2003,1.0,1400


sk Learn has built this train_test_split functionality out for you. It lives in the sk Learn library, under the `model_selection` module. 

import it now

In [7]:
from sklearn.model_selection import train_test_split

Okay! Create your X_train, X_test, y_train, and y_test now

In [8]:
feature_cols = ['AST', 'FGP', 'DR']

X = ncaa_df[feature_cols]
y = ncaa_df.Rank

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Try that again, but this time, make your training set 25% of your total dataframe, and set your random_state to 123

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# Great job! 
You've successfully created a training set and a testing set