# Overfitting in Decision Trees

### Introduction

In previous lessons we have seen that we can train a decision tree to discover how to split our data according to features that can predict the target of an observation.  For example, if we look at the decision tree for customer leads, we can see that our tree perfectly segmented our dataset between customers and non-customers.

![](DTreeViz_customers.svg)

### Working with future data

One thing that we want to emphasize, is that it's really not of any use to be able to predict our training data.  After all our training data, is past data that we gathered our features and our labels for.  We *know* how each observation in our training data turned out.  We don't need to train a model to tell us.  But what we hope is that by finding what distinguishes different outcomes in our observed training data, these same differences will show up and be predictive of how future data will turn out.

In this lesson we'll see how training a model to predict future outcomes is a very different problem than training a model that predicts our past training data.

* Loading our data

For this lesson, let's use one of the built in sklearn datasets -- we'll use the diabetes dataset, which is located in the `sklearn.datasets` module.

In [2]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()

The `load_diabetes` method returns a dictionary with our feature data, listed as data, and our target data listed as target.  The `feature_names` tells us, well the name of each feature.

In [4]:
dataset.keys()

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

We can turn this into a pandas dataframe with the following code.  (Don't worry, we'll talk all about pandas and dataframes in following lessons.)

In [5]:
import pandas as pd

In [25]:
y = pd.Series(dataset['target'])
y[:3]

0    151.0
1     75.0
2    141.0
dtype: float64

> The numbers in the target above represent the quantitative measure of disease progression one year after baseline.

In [10]:
X = pd.DataFrame(dataset['data'], columns = dataset['feature_names'])

In [12]:
X[:2]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204


The `s_1` - `s_6` represents the six blood serum measurements recorded from the patients in the study.  Each of the features above will be used to predict the disease progression numbers in our target.  

> If you'd like to learn more about the dataset uncomment, and run the line below.

In [32]:
# print(dataset['DESCR'][:900])

Ok, now, let's select all but 20 rows from our dataset and train the model (we'll explain why later). 

In [35]:
X.shape

(442, 10)

In [65]:
X_except_ten = X[:-10]

In [66]:
y_except_ten = y[:-10]

### Training and Evaluating our model

In [39]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()

model.fit(X_except_ten, y_except_ten)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

We can get see how well our model predicts different data with the score method.  The score method works the way we would expect - it feeds each of observation's features through the model, and compares what the model predicts to the actual observed value.  Let's see how well our model predicts the training data.  A `1` is the highest and `0` means that our model does no better than if we predicted simply predicted the mean every time.

> To learn more about the score is calculated for regression problems, see the [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination).

In [40]:
model.score(X_except_ten, y_except_ten)

1.0

Hot damn.  A perfect score.  Now let's see how our model predicts data it has not seen.  A perfect score signifies that our model predicted every observation perfectly.  And it looks like it does.  The values by our model and actual values are the same.

In [55]:
model.predict(X_except_ten[:10])

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310.])

In [56]:
y[:10]

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
5     97.0
6    138.0
7     63.0
8    110.0
9    310.0
dtype: float64

Now let's see how our model does on data it did not see -- the last ten values of our dataset.

In [57]:
model.score(X[-10:], y[-10:])

-0.1744688599873101

Wow.  And we can see that on data it did not yet see, the predictions are not very good.  

The model predicts:

In [60]:
model.predict(X[-10:])

array([268.,  52., 113.,  50.,  37., 273.,  97.,  71.,  99.,  59.])

And the actual values are:

In [59]:
y[-10:]

432    173.0
433     72.0
434     49.0
435     64.0
436     48.0
437    178.0
438    104.0
439    132.0
440    220.0
441     57.0
dtype: float64

A negative score signifies that our model did worse than if we simply predicted the mean for each observation.  What's going on?

### An observation in every pot

We can see what's going on if we just take a look at our regression tree.  Below is an image of just some of the leaves of the regression tree.

<img src="./leaf-nodes.png">

The bottom line of charts represent our leaf nodes.  The $n = 1$ means that there is one observation in essentially every node.  So this means that our decision tree is finding a set of if else statements that makes a separate prediction for each observation.

What's the problem with that?  The problem is that there's randomness in every dataset.  And if it just finds the decision tree rules that happen to match the past data, but do not *generalize* to future data, then the model is not useful to us.  Once again, we *know* how the past data turned out, what we want is a machine learning model that can look at features of a *future* observation and predict the outcome.

So the way that we really evaluate our machine learning algorithm is to hold back some of our data from training, and then see how well our machine learning model performs on data it's never seen before.  

This is called our test train split.  That is we split our data - and use some of it for training our model, and the rest for testing to see how well our model does on data it did not see.  The test set is for evaluating to get a sense of how well our model will perform on future data.

What percentage of data should be used in training, and what percentage in testing?  Well holding back 20 percent for testing is an ok rule of thumb to start with.  Let's try it.

### Implementing our Test Train Split

We can use the shape method to see that we have 442 observations in our dataset.  

In [67]:
X.shape

(442, 10)

Let's hold back 20% of the data.

In [68]:
.2*442

88.4

Ok, so now we assign all but the last 88 observations to our training set, and the remainder to our test set.

In [69]:
X_train = X[:-88]
y_train = y[:-88]

X_test = X[-88:]
y_test = y[-88:]

In [71]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [72]:
model.score(X_test, y_test)

-0.24113290219476746

Well once again, our decision tree does not do a good job at predicting dataÂ it has not yet seen.  It's for the same reason - it's just finding rules to match every observation it's seen, but to do so, it's coming up with logic that is not repeating itself.  In other words, it is not finding the signal from the noise.

But don't get discouraged.  There is a solution of course.  We'll see it in the next lesson.

### Summary