# The Test Train Split

### Introduction

In previous lessons we have seen that we can train a decision tree to discover how to split our data according to features that can predict the target of an observation.  For example, if we look at the decision tree for customer leads, we can see that our tree perfectly segmented our dataset between customers and non-customers.

<img src="DTreeViz_customers.svg" width="50%">

### Yea, But Can We Trust It?

So we saw how to train a decision tree in Sklearn.  Can we really rely on this decision tree to predict future observations?  After all, it's really not impressive that our decision tree can predict our training data.  Our training data consists of past data that we already gathered.  We *know* how each observation in our training data turned out.  We don't need a model to tell us.  

Our hope, is that by finding what distinguished different outcomes in our training data, we can find a hypothesis function that predicts how future data will turn out.  Can we rely on our model to do this?  Unfortunately, **we cannot**.

In this lesson, we'll learn about why we cannot rely on our model to predict future data, and then in the following lesson we'll see how to correct for this.

## Loading our data

For this lesson, let's use one of the datasets available to us in sklearn -- we'll use the diabetes dataset, which is located in the `sklearn.datasets` module.

> Press `shift + return` to load the data.

In [3]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()

The `load_diabetes` method returns a dictionary.

In [4]:
dataset.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

Our feature data is listed under the key of `data`, and our target data listed under `target`.  The `feature_names` tells us the name of each feature.  We can turn this into a pandas dataframe with the following code.

> Don't worry, we'll talk all about pandas and dataframes in following lessons.

First we load up our target values.

In [4]:
import pandas as pd
y = pd.Series(dataset['target'])
y[:3]

0    151.0
1     75.0
2    141.0
dtype: float64

> The numbers in the target above represent the quantitative measure of disease progression one year after baseline.  The larger the numbers, the more the progression.

In [8]:
X = pd.DataFrame(dataset['data'], columns = dataset['feature_names'])
X[:2]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204


The `s_1` - `s_6` represents the six blood serum measurements recorded from the patients in the study.  Each of the features above will be used to predict the disease progression numbers in our target.  

> If you'd like to learn more about the dataset, uncomment and then run the line below.

In [32]:
# print(dataset['DESCR'][:900])

Ok, now, let's select all but 20 rows from our dataset and train the model (we'll explain why later). 

In [9]:
X.shape

(442, 10)

In [24]:
X_train = X[:-20]

In [25]:
y_train = y[:-20]

Let's store these last twenty values in what we'll call our holdout data.

In [17]:
X_holdout = X[-20:]
y_holdout = y[-20:]

### Training and Evaluating our model

In [26]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

We can get see how well our model predicts different data with the score method.  The score method works the way we would expect - it feeds each of observation's features through the model, and compares what the model predicts to the actual observed value.  Let's see how well our model predicts the training data.  A `1` is the highest and `0` means that our model does no better than if we predicted simply predicted the mean every time.

> To learn more about the score that Sklearn uses for regression problems, see the [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination).

In [27]:
model.score(X_train, y_train)

1.0

Hot damn!  A perfect score.  Now let's see how our model predicts data it has not seen.  A perfect score signifies that our model predicted every observation perfectly.  And looking at the outputs of the `predict` method below, it looks like it does.  The values by our model and actual values are the same.

In [28]:
model.predict(X_train[:10])

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310.])

In [16]:
list(y[:10])

[151.0, 75.0, 141.0, 206.0, 135.0, 97.0, 138.0, 63.0, 110.0, 310.0]

Now let's see how our model does on data it did not see -- the twenty values that we held back from our training set.

In [29]:
model.score(X_holdout, y_holdout)

-0.07654269368019251

Wow.  Ok, so on data our model did not see in training, the predictions are not very good.  

For example, in ten observations in the holdout set, the model predicts:

In [30]:
model.predict(X_holdout[10:])

array([268.,  51.,  96.,  88.,  72., 273., 111.,  77.,  99.,  96.])

And the actual values are:

In [31]:
list(y_holdout[10:])

[173.0, 72.0, 49.0, 64.0, 48.0, 178.0, 104.0, 132.0, 220.0, 57.0]

A negative score signifies that our model did worse than if we simply predicted the mean for each observation.

In [32]:
y_train.mean()

153.36255924170615

What's going on?  Why did our decision tree model predict data that it trained on perfectly, yet did so poorly on data it did not see?

### An observation in every pot

We can see what's going on if we take a look at our regression tree.  Below is an image of just some of the leaves (that is, the last layer) of the regression tree where the predictions are made.

<img src="https://storage.cloud.google.com/curriculum-assets/intro-to-ml/leaf-nodes.png">

The bottom line of charts displays our leaf nodes.  The $n = 1$ means that there is one observation in essentially every node.  So this means that our decision tree is finding a set of if else statements that makes a separate prediction for each observation.

What's the problem with that?  The problem is that there's randomness in every dataset.  And if our decision tree just consists of rules that happen to match the past data, but do not *generalize* to future data, then the model is not useful to us.  

> It's like a person who makes predictions on football games by saying, that one time when it was 20 degrees outside on a Thursday, the home team won, so the next time it's 20 degrees on a Thursday the hometeam will win.  Because this prediction is so specific, we do not know if it will generalize to future data.

So the way that we really evaluate our machine learning algorithm is to hold back some of our data from training, and then see how well our machine learning model performs on data it's never seen before.  

This is called our test train split.  That is we split our data - and use some of it for training our model, and the rest for testing to see how well our model does on data it did not see.  The test set is for evaluating to get a sense of how well our model will perform on future data.

What percentage of data should be used in training, and what percentage in testing?  Well holding back 20 percent for testing is an ok rule of thumb to start with.  Let's try it.

### Implementing our Test Train Split

We can use the shape method to see that we have 442 observations in our dataset.  

In [67]:
X.shape

(442, 10)

Let's hold back 20% of the data.

In [68]:
.2*442

88.4

Ok, so now we assign all but the last 88 observations to our training set, and the remainder to our test set.

In [69]:
X_train = X[:-88]
y_train = y[:-88]

X_test = X[-88:]
y_test = y[-88:]

In [71]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [72]:
model.score(X_test, y_test)

-0.24113290219476746

Well once again, our decision tree does not do a good job at predicting data it has not yet seen.  It's for the same reason - it's just finding rules to match every observation it's seen, but to do so, it's coming up with logic that is not repeating itself.  In other words, it is not finding the signal from the noise.

But don't get discouraged.  There is a solution of course.  We'll see it in the next lesson.

### Summary

In this lesson, we saw the a machine learning model's ability to accurately predict the outcomes of training data does not necesarily mean that it can accurately predict the outcomes of data it was not trained on. 

One way to assess this is to set split our data into a training dataset and a test dataset.  So this will have us train our model on only a portion of our data, and then after our model is trained, we can see how our model performs on data it was not trained on.

An easy way to assess this is with the `model.score()` method.  This method takes arguments of the features and observed targets, and sees how well the predictions match what was observed. 