# Overfitting a Decision Tree

One of the issues with a decision tree is that you can easily ovefit it. Let's look at an example to see how changing the depth of different decision trees affects overfitting. We'll create some random data with some noise and see how a decision can help make the prediction using different depth levels.

In [None]:
# import packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Create a random data set that builds the sine curve
rng = np.random.RandomState(42)

# Create 80 random data points for X
# range from 0 to 5
X = np.sort(5*rng.rand(80,1), axis=0)

In [None]:
# Look at X
X

In [None]:
# Create y (the target) which will be the sine curve
y = np.sin(X).ravel()

# Now add some noise (randomness) to the curve at every 5th element
y[::5] += 3 * (0.5 - rng.rand(16))

# plot the X and y to see our noisy data
plt.scatter(X, y, s=20, edgecolor="black",
           c="darkorange", label="data");

In [None]:
# Fit a decision tree with max_depth=2
reg1 = DecisionTreeRegressor(max_depth=2)
reg1.fit(X, y)

In [None]:
# Create a X_test data set
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]

# Predict using decision tree regressor
y_1 = reg1.predict(X_test)

In [None]:
# see what y_1 looks like
y_1

In [None]:
# plot the data and the max_depth=2
# plot the X and y to see our noisy data
plt.scatter(X, y, s=20, edgecolor="black",
           c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
        label="max_depth=2", linewidth=2);

## Try Depth of 6

This should give us a more flexible predictor.

In [None]:
# Let's try a new decision tree with max_depth=6
reg2 = DecisionTreeRegressor(max_depth=6)
reg2.fit(X, y)

# Make predictions with this new tree
y_2 = reg2.predict(X_test)

In [None]:
# plot the data and the max_depth=2 decision tree and the max_depth=6 tree
plt.scatter(X, y, s=20, edgecolor="black",
           c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
        label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen",
        label="max_depth=6", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend();

## Try the Full Tree

Simply do not use the argument `max_depth` to build the full tree.

In [None]:
# What if we let it build the full tree
reg3 = DecisionTreeRegressor()
reg3.fit(X, y)

# predict using the full tree
y_3 = reg3.predict(X_test)

In [None]:
# plot the data and the max_depth=2 decision tree and the max_depth=6 tree
plt.scatter(X, y, s=20, edgecolor="black",
           c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
        label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen",
        label="max_depth=6", linewidth=2)
plt.plot(X_test, y_3, color="red",
        label="full tree", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend();

In [None]:
# what does the resulting tree look like?
from sklearn.tree import export_text
print(export_text(reg3))

In [None]:
from sklearn.tree import plot_tree
fig = plt.figure(figsize=(15,15))
plot_tree(reg3, filled=True);

## Pruning

Instead of explicitly setting the `max_depth`, you can try pruning with the `ccp_alpha` argument.

In [None]:
# How about pruning?
# ccp_alpha=0.1 looks like?
reg4 = DecisionTreeRegressor(ccp_alpha=0.1)
reg4.fit(X, y)

# Predict for this tree
y_4 = reg4.predict(X_test)

In [None]:
# plot the data and the max_depth=2 decision tree and the max_depth=6 tree
plt.scatter(X, y, s=20, edgecolor="black",
           c="darkorange", label="data")

plt.plot(X_test, y_3, color="red",
        label="full tree", linewidth=2)
plt.plot(X_test, y_4, color="purple",
        label="pruned tree", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend();

In [None]:
fig = plt.figure(figsize=(15,15))
plot_tree(reg4, filled=True);

### Try a Different Value for `ccp_alpha`

