# Decision Trees and CART - Regression

As discussed in class, Classification and Regression Trees (CART) work by partitioning space of possible solutions into simpler subsets through a hierarchy of tests on the input features. To predict the (unknown) value of a new observation, we examine the split condition at each point in the tree, test the condition against the features of the new observation and follow the "true" or "false" path accordingly. We repeat this process until we reach a leaf of the tree, as which point the prediction is made by returning the value residing at the leaf node. In the context of regression, the value of the leaf nodes is the mean of the response values of the training data that reside at the point in space represented by the node.

To work with CART in Python, we an make use of the existing [numpy](https://numpy.org/) and [scikit-learn](https://scikit-learn.org/) libraries. Let's start by importing them (and a [matplotlib](https://matplotlib.org/) for some plotting of the results):

In [None]:
import numpy as np

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error

from matplotlib import pyplot as plt

## Our "Training" Data (this is identical to Lecture 5)

We'll also need some data to act as the historical observations from our problem. For this example, we'll just make some up (but usually, you'd be given this data). In this case, we (the people doing the work) actually know the underlying function $f(X)$, but from the modelling perspective this function is not known:

In [None]:
def f(X):
    return (X[:, 0] + 0.5)**2 + 0.25 * np.sin(4 * np.pi * X[:, 0])

Now, we will use this function to generate some training data:

In [None]:
rng = np.random.default_rng(1234) ## notice the fixed seed for reproducability

n_points = 50
X_train = rng.uniform(-1, 1, size=n_points).reshape((-1, 1))
y_train = f(X_train) + rng.normal(0, 0.1, size=len(X_train))

## we'll also generate some "test" data use this to test the shape of our learned function shortly
X = np.linspace(-1, 1, 500).reshape((-1, 1))
y = f(X)

Let's take a look at the data (and underlying generating function) before moving on to modelling:

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, alpha=0.2, label='Sampled Training Data')
plt.plot(X, y, color='black', label='True Underlying Function - f(X)')
plt.title('Our Sampled Training Data')
plt.xlabel('X')
plt.ylabel('y')
plt.legend();

## Applying CART

Now that we have a training set of data, we can move onto modelling. We do this by instantiating a DecisionTreeRegressor model, calling the fit function, and then making predictions from the resulting model with the predict function. When building a CART model, we need to specify the minimum split size of a node (called <code>min_samples_split</code> in scikit-learn)) - here we will examine this process for a range of <code>min_samples_split</code> values:

In [None]:
fix, ax = plt.subplots(2, 3, figsize=(16,10))
for (i, k) in enumerate([ 2, 3, 5, 10, 25, 2*n_points ]):
    mdl = DecisionTreeRegressor(random_state=0, min_samples_split=k)
    mdl.fit(X_train, y_train)
    leaf_loss = mean_squared_error(f(X), mdl.predict(X))

    r = i // 3
    c = i % 3
    ax[r, c].scatter(X_train, y_train, alpha=0.2, label='Sampled Training Data')
    ax[r, c].plot(X, f(X), color='black', label='True Underlying Function - f(X)')
    ax[r, c].plot(X, mdl.predict(X), color='#ce2227', label="Estimated Function - CART, minsplit={}".format(k))
    ax[r, c].set_title("CART Performance, MSE={}".format(np.round(leaf_loss, 2)))
    ax[r, c].set_xlabel('X')
    ax[r, c].set_ylabel('y')
    ax[r, c].legend()


In each plot, the red line represents the model that was extracted from the training data by the algorithm for a given <code>min_samples_split</code> value. Notice that when <code>min_samples_split</code>=2, the resulting model is quite sensitive to noise in the training data (it captures a lot of this noise in the model and so it regularly deviates from the true underlying function). As <code>min_samples_split</code> is increased, the resulting models become less sensitive to the noise in the training data. However, there is a balancing act here: as <code>min_samples_split</code> becomes very large, CART starts to lose some of the detail in the underlying function. At its extreme (<code>min_samples_split</code> is largerthan size of the training data), the algorithm produces a model that is equivalent to the mean of the training data for all cases (hence the stright line).

## Visualising the CART model
In the above examples, CART was run on the same data six times with different values for <code>min_samples_split</code>, which resulted in six different trees being produced. One of the most valuable aspect of CART is that, for reasonably small trees, the model is highly interpretable (i.e., we can read the tree as a series of rules, and we can visualise the tree as a hierarchical series of splitting rules. To visualise a tree, we can use the <code>plot_tree</code> function from scikit-learn. For example, the six models created in the previous cell can be visualised as follows:

In [None]:
fix, ax = plt.subplots(2, 3, figsize=(16, 10))
for (i, k) in enumerate([ 2, 3, 5, 10, 25, 2*n_points ]):
    mdl = DecisionTreeRegressor(random_state=0, min_samples_split=k)
    mdl.fit(X_train, y_train)
    leaf_loss = mean_squared_error(f(X), mdl.predict(X))

    r = i // 3
    c = i % 3
    plot_tree(mdl, ax=ax[r, c], feature_names=[ 'X' ], rounded=True, label='none', proportion=False, impurity=False, filled=True)
    ax[r, c].set_title("CART Performance, minsplit={}, MSE={}".format(k, np.round(leaf_loss, 2)))

We interpret the trees as follows: the leaf nodes show the number of training instances that matched the splitting criteria up to this point (top number), along with the mean of the response of these training instances (the bottom number). The interior nodes show three things: the splitting rule at this point (top row), the number of matching training instances at this point in the tree (second row), and the mean of the response of these training instances (bottom row). We read the tree as a series of if-then-else rules. For example, the CART tree resulting from <code>min_samples_split</code>=25 can be read as follows:
<pre>
if X <= 0.565 then
    if X <= 0.015 then
        if X <= -0.79 then
            prediction = 0.417
        else
            prediction = 0.028
    else
        prediction = 0.654
else
    prediction = 1.745
</pre>

Note that, as <code>min_samples_split</code> _increases_, the tree complexity _decreases_, right up to the point where the tree collapses into a single node.

## Examining the effect of <code>min_samples_split</code>
The main hyperparameter for CART is the minimum node split size <code>min_samples_split</code>. In the previous step, we looked at an arbitrary set of possible values for <code>min_samples_split</code> - let's now be a little more rigorous and examine the performance of CART over a more thorough sweep of values of <code>min_samples_split</code>:

In [None]:
all_split = np.arange(2, n_points + 1)
split_loss = []
for k in all_split:
    mdl = DecisionTreeRegressor(random_state=0, min_samples_split=k)
    mdl.fit(X_train, y_train)
    split_loss.append(mean_squared_error(f(X), mdl.predict(X)))

Finally, we can plot the loss against the neighbourhood size to see how $k$ influences the behaviour of the algorithm and the resulting model:

In [None]:
best_split = np.argmin(split_loss)
fig = plt.figure(figsize=(16,10))
plt.rcParams.update({'font.size': 24})
plt.plot(all_split, split_loss)
plt.scatter(all_split[best_split], split_loss[best_split], color='#ce2227', label="Lowest MSE, minsplit={}".format(all_split[best_split]))
plt.xlabel('minsplit')
plt.ylabel('MSE')
plt.title('CART Performance for minsplit')
plt.legend();

In this result, we can see that (for this sample of training data!) the lowest error can be achieved with <code>min_samples_split</code> set to 6. We can also see that performance degrades slightly when <code>min_samples_split</code> is smaller than 6 (due to overfitting the data), and error gets progressively worse with larger values of <code>min_samples_split</code> (due to increasing underfitting of the data). We will discuss this (and strategies for algorithmic tuning and model selection) in later lectures.