# Overfitting and Early Stopping

## Introduction

In previous lessons, we have seen that decision trees choose features that split our data into different groups of target variables.  For example, when we look at the decision tree for customer leads, our tree perfectly segments our dataset between customers and non-customers.

![](DTreeViz_customers.svg)

In this lesson, we'll see how decision trees can segment our data a little too well.  That is, we'll learn one way that decision trees are prone to overfitting.  Remember that overfitting is a problematic because it means that our model is fitting to the variance in the data as opposed to capturing the true underlying model.

For this lesson, let's work with the diabetes dataset provided with sklearn.

## Working with the diabetes dataset

Let's start by loading up our data.

In [4]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
dataset = load_diabetes()
X = dataset['data']
y = dataset['target']

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

The following information about the training and target data is provided. 

> **Target:** A quantitative measure of disease progression one year after baseline

> **Attribute Information**
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

In [5]:
X_train.shape

(296, 10)

So we have ten features and close to 300 observations.

## Training our decision tree

Now let's train our decision tree and see how it performs.

In [6]:
from sklearn.tree import DecisionTreeRegressor
dtc = DecisionTreeRegressor()
dtc.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [7]:
dtc.score(X_train, y_train)

1.0

Look at the score above, our first perfectly fit model.  Of course, what really matters is how well our decision tree performs on data it has not yet seen.

In [8]:
dtc.score(X_test, y_test)

-0.14862408533742344

Oops.

Here is just a small portion of the decision tree that was created above.

![](./d-tree.png)

Take a look at the bottom layer of the decision tree above.  At the bottom layer of decision tree we have a lot of leaves, each with a sample size of just one.  

* So our decision tree is able to perfectly fit our data.  
* But in most if not all of the occurrences, it's prediction is based on a sample size of one.

### Correcting for Overfitting

So now we can better understand how our decision tree overfits to the training data.  It overfits because if prediction contains any error, our decision tree simply tries another split. At a certain point, the tree is no longer fitting to the pattern in the data, but on the data's random variation.  In the decision tree above, we ended up performing worse than the mean.

One way to correct for this overfitting is to use early stopping.  This stops the decision tree at an earlier level, before it's leaves have just one observation per leaf.

Let's see this in action.

### Constraining the number of levels

We can prevent our decision tree from splitting until there is a single observation in each leaf node by setting the `max_depth` of the tree.

In [9]:
from sklearn.tree import DecisionTreeRegressor
dtc = DecisionTreeRegressor(max_depth = 2)
dtc.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

Take a look at what this accomplished.

In [19]:
from dtreeviz.trees import *
viz = dtreeviz(dtc,
               X_train,
               y_train,
               target_name='progression',
               feature_names=dataset['feature_names'])    
viz.view()

<img src="./constrained-levels.png" width="60%">

Notice that now we have constrained the number of levels to just the initial level and 2 more.  This means that there are only two levels of splits in our data.  And we can see that at the leaf nodes, we have a minimum of 22 samples (in the borrom right node) from which we predict make the predictions.  

Let's see how the model performs on the training and test sets.

In [23]:
dtc.score(X_train, y_train)

0.43114967656078584

In [24]:
dtc.score(X_test, y_test)

0.40359839634724004

So we can see that in preventing our model from making a separate prediction for each observation, we prevent overfitting, and thus achieve a higher accuracy on our holdout set.

What we just saw is an example of a hyperparameter.

> A hyperparameter is a parameter whose value is set *before* our model begins training. By contrast, our other parameters are derived through training [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)).

So here we set the `max_levels` parameter, before we then trained our model.  This makes it hyperparameter.  

### Summary

In this lesson, we saw overfitting in decision trees.  Decision trees are highly flexible, allowing decision trees to closely match any data.  This is useful when decision trees are responding to underlying patterns in the data, but makes them prone to overfitting as they are also flexible enough to respond to pure noise.

The number of datapoints in our leaf nodes influences the flexibility of our decision trees.  Without a minimum sample size on our leaf nodes, our tree makes a separate prediction based on a sample size of just one datapoint.  To correct for this, we can assign a `max_levels` parameter that limits the number of levels, and thus increases the number of observations in the leaf nodes.  