# Decision Trees

Like SVMs, decision trees are a very popular form of ML. They are also a constituent of random forests which are very powerfulML engines. They can be used for both classification and regression. They work by dividing the data along different hard lines (or cuts) that best separate the data. They keep on doing this until either they can see no more improvement or they reach the maximum number of cuts that you have allowed them to make. So lets write one and have a look. We will look at the iris data again.

In [None]:
import numpy as np
import pylab as pl 
from sklearn.datasets import load_iris
pl.rcParams['figure.figsize'] = [10, 5] # setting a nice big figure size
iris=load_iris()
display(iris.feature_names)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'],iris['target'], test_size=0.2,random_state=20) 

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=2, random_state =50) 
# max_depth is important is it the maximum number of divisions. 
# Random_stae just means that you will get the same numbers as I do.
clf.fit(X_train,y_train)


In [None]:
pred=clf.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))


In [None]:
tree.plot_tree(clf,filled=True)
pl.show()

So you can see that the first node (called the root node) looks at feature X[3] (the petal width) and makes a decision based on whether it is greater than or less than 0.8. Nodes with arrows going to them and away from them are called internal nodes (or just nodes). A node with only arrows going to it and none away is called a leaf. The algorithm then continues doing this until either it sees that there is no point or it has got the maximum depth that you have allowed it. You can see in the leaf on the left that there is no point going any further as it is already only of one sort of iris. Whereas the right hand side goes until it has gone to the maximum depth of two.


The location of the cut is chosen to maximising the reduction in impurity using the *Classification and Regression Tree*, CART, algorithm. By default this uses the Gini impurity defined as:

$G=1-\Sigma_i p_i^2$

where $p_i$ is the probability of getting a given outcome -- the lower the gini impurity the purer the sample.

So you can see that the leaf on the left has a Gini impurity of 0 as it is already pure, whereas the leaf in the middle has an impurity of

$ G= 1- \left(\left(\dfrac{37}{38}\right)^2 +\left(\dfrac{1}{38}\right)^2\right) = 0.051$

Gini impurity is the default, but you can also use entropy, defined as:

$H=\Sigma_i p_i\log_2 (p_i) $

Being honest this seems to make little difference in reality. It is said that Gini tends to produces branches that are pure in one classification whereas entropy tends to produce a more balanced tree. I have never made any systematic studies myself. 


### Below is just some code that I found (I think on the sklearn site) that will plot the results for the iris -- it is only included in case you wanted to see how this works for a (different) tree with multiple dimensions and want to reuse the code. Most of what we do for the next bit will be in two dimensions just because it is easier to visualise

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Parameters
n_classes = 3
plot_colors = "ryb"
plot_step = 0.02

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
    )
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(
            X[idx, 0],
            X[idx, 1],
            c=color,
            label=iris.target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15,
        )

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend(loc="lower right", borderpad=0, handletextpad=0)
plt.axis("tight")

plt.figure()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)
#plot_tree(clf, filled=True)
plt.show()

## So now lets look a little more at DTs

OK so lets use the sklearn moons again.

In [None]:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.25, random_state=42)

def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

plot_dataset(X, y, [-1.8, 2.7, -1.5, 1.8])
plt.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=20) 
print(len(y_train))
clf = tree.DecisionTreeClassifier(max_depth=200)
pl.rcParams['figure.figsize'] = [15, 15] #nice big plots
clf.fit(X_train,y_train)
tree.plot_tree(clf,filled=True,fontsize=12)
pl.savefig('tree.png')
pl.show()


In [None]:
pred=clf.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))


In [None]:
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    
    custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
    pl.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        pl.plot(X[:, 0][y==0], X[:, 1][y==0], "yo")
        pl.plot(X[:, 0][y==1], X[:, 1][y==1], "bs")
        pl.plot(X[:, 0][y==2], X[:, 1][y==2], "g^")
        pl.axis(axes)
    pl.xlabel(r"$x_1$", fontsize=18)
    pl.ylabel(r"$x_2$", fontsize=18, rotation=0)


        


In [None]:
pl.rcParams['figure.figsize'] = [10, 5] #more reasonable plots
plot_decision_boundary(clf,X_train,y_train, axes=[-1.8, 2.7, -1.5, 1.8])

In [None]:
# now look at the testing data
plot_decision_boundary(clf,X_test,y_test, axes=[-1.8, 2.7, -1.5, 1.8])

## Overfitting

You can clearly see a tendency of decision trees which is that they tend to overfit. There are number of things that you can do to regularize this. You have aleardy seen the max_depth parameter, but you can use min_samples_leaf (which does what it says, the minimum number a leaf must have), and the min_samples_split (the minimum number of sample a node must have before it can split).  

## Exercise 

Try using different ways of regularising these data and try to understand how well they generalise. Investigate all three ways.

## Regression

DTs can also be used for regression. Again we will look at a simple 2-D case.

In [None]:

#generate some data
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt

np.random.seed(42)

m = 1000
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])

plt.show()

In [None]:
#Now perform some regression
reg=tree.DecisionTreeRegressor(max_depth=2)
reg.fit(X,y)

#now draw the results on top of the original
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
xd=np.linspace(-3,3,1000)
yd=reg.predict(xd.reshape(-1,1))
pl.plot(xd,yd,'r-')
pl.show()


## Exercise
Now try changing the max_depth going up to ranges of (say) 1 to 300. Start off going up in in single units and then make bigger jumps.

In [None]:
# and the tree itself
tree.plot_tree(reg,filled=True)

## The problem with decision trees ...

is that they are not very good as predictors. Otherwise they would be used everywhere. We will see this now by looking at the MNIST data again. 

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
print(mnist.keys())

In [None]:

X, y = mnist["data"], mnist["target"]


In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
print(X_train.shape,X_test.shape)

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=20)
clf.fit(X_train,y_train)
#tree.plot_tree(clf,filled=True)

In [None]:
#print(X_test.shape)
pred=clf.predict(X_test)
#print(pred.shape)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))


## Exercise 

Use GridSearchCV to try to find the hyperparameters that best predict the MNIST data. Carry out the same validation as you did for the kNN algorithm. i.e. best_params_, best_score and test against the testing set. FInally print out the predicitions for the first 20 numbers and compare with the actual images (the code below preints the images)


In [None]:
# compare the first 10
import matplotlib as mpl
for i in range(10):
    pl.subplot(1,10,i+1)
    some_digit = X_test[i] # just to pick an arbitrary figure. Try a different one
    some_digit_image = some_digit.reshape(28, 28)
    plt.imshow(some_digit_image, cmap=mpl.cm.binary)
    plt.axis("off")




# Ensemble learning

Sometimes many models are better than one. When you do this it is called ensemble learning. You can have different types of models all returning an an answer, with a voting scheme between them. sklearn has VotingClassifier for this. If a straight vote is taken then voting is said to be hard (voting='hard'), however, if the classifier being called all have a predict_prob() method then soft voting can be used as well where the probablility is also taken into account. Because of the lack of time we will not go through an example of these now  but you should know that they exist and that you can use them. There are plenty of examples online.

The area where ensemble learning is used most is with DT.


## Bagging and pasting

Another way of generating lots of estimatimaters is to use the same classifier many times and each time train them on a randomly chosen sample from your training set. If these samples are taken with replacement this is called bagging (apparently short for *bootstrap aggregation*) if the samples are taken without replacement the this is called pasting. *With replacement* means that a sample is taken, its features recorded and then it is "thrown back into the bag" meaning that it can be selected again. If it isn't thrown back then this is without replacement i.e. pasting. Now lets look again at our moon distributions.

In [None]:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.25, random_state=42)

def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

plot_dataset(X, y, [-1.8, 2.7, -1.5, 1.8])
plt.show()

If we use a DT without any max_depth set we get:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=20) 
print(len(y_train))
clf = tree.DecisionTreeClassifier()
clf.fit(X_train,y_train)
pl.rcParams['figure.figsize'] = [10, 5] #more reasonable plots
plot_decision_boundary(clf,X_train,y_train, axes=[-1.8, 2.7, -1.5, 1.8])

Whereas if we use a bagging algorithm we get:

In [None]:
# now if we try bagging 
from sklearn.ensemble import BaggingClassifier

bclf=BaggingClassifier(
    tree.DecisionTreeClassifier(),n_estimators=500,max_samples=100, bootstrap=True,n_jobs=4,oob_score=True)
#n_jobs is the number of CPU core it will use both in training and predicting.
#n_estimators is how many classifiers it will run
#max_samples is number of samples to draw from the training X
#bootstrap means bagging
#oob= "out of the bag"
bclf.fit(X_train,y_train)
plot_decision_boundary(bclf,X_train,y_train, axes=[-1.8, 2.7, -1.5, 1.8])

pred=bclf.predict(X_test)
print(bclf.oob_score_)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

## Exercise 

Try playing around with the max depth to see what difference it makes. Also try different numbers estimators and samples. The documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html).

As not all of your sample is used in each training you can use the remaining data  for testing each training. This is called out-of-the-bag (oob) evaluation and you can request this.

## Exercise

Bagging can also be used for regression. Try this on the data set that we looked at earlier for regression.

## Random forests

Random forests are ensembles of DTs, usually with bagging (possible with pasting) where the max_sample is around the size of the training sample. Clearly you can build these using the BaggingClassifier, however sklearn has them built in. See the code below

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfclf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=20,n_jobs=4,oob_score=True)

rfclf.fit(X_train,y_train)
plot_decision_boundary(bclf,X_train,y_train, axes=[-1.8, 2.7, -1.5, 1.8])

pred=rfclf.predict(X_test)
print(rfclf.oob_score_)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

## Feature importance

It is often useful to look at the importance of different features in your data. For example look at the example below for the training MNIST data set.


In [None]:
X_m, y_m = mnist["data"], mnist["target"]
X_train_m, X_test_m, y_train_m, y_test_m = X_m[:60000], X_m[60000:], y_m[:60000], y_m[60000:]

rfclf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=20,n_jobs=4,oob_score=True)
rfclf.fit(X_train_m,y_train_m)



In [None]:
pred=rfclf.predict(X_test_m)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test_m, pred))


In [None]:
import matplotlib as mpl
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.hot,
               interpolation="nearest")
    plt.axis("off")

In [None]:
plot_digit(rfclf.feature_importances_)

cbar = plt.colorbar(ticks=[rfclf.feature_importances_.min(), rfclf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])

plt.show()

While they would not mean much for MNIST the numerical values can be important too, so if we consider our iris dataset again.

In [None]:
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(iris['data'],iris['target'], 
                                                            test_size=0.2,random_state=20) 
rfclf=RandomForestClassifier(n_estimators=500,n_jobs=4)
rfclf.fit(X_train_i,y_train_i)
display(rfclf.feature_importances_)

# which you can then plot 
import pandas as pd
imps=pd.Series(rfclf.feature_importances_, index=iris.feature_names)
imps.plot.bar()

# There are a number of different ways to plot these and this is just one example.


## Boosting

Another, often more powerful, way of combining DTs is through boosting. Here you choose your ensembles in non-random ways, generally combining weak learners (usually short DTs) to produce strong learners. This is often done by comparing to previous values and updating appropriate. There are lots of different boosting algorithms. The most common in my world is *gradient boosting* (with the models being called Boosted Decision Trees or BDTs) -- these are very powerful tools and we use them a lot. In other parts of machine learning AdaBoost is also quite common. 

Both gradient boosting and AdaBoost are too long to explain here. But I have found some online videos that explain them pretty well if you want to look at them, in your own time (yes, I hate the singing as well). For AdaBoost I would recommend [this](https://www.youtube.com/watch?v=LsK-xG1cLYA&t=1054s) and gradient boosting [this](https://www.youtube.com/watch?v=3CC4N4z3GJc) and, shorter, [this](https://www.youtube.com/watch?v=0Xc9LIb_HTw)

For the sake of time we will only look at gradient boosting.

Although sklearn does have gradient boosting the best "on the market" is [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) which is a very powerful and very fast tool. I will will only give you one very simple example just to show you that it works just like sklearn -- well at least the syntax is the same -- under the hood ...  I feel guilty that I am not showing you more as there are so many different parameters that you can tune but you could easily spend all three hours on XGBoost. 

The calling parameters are:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


I suggest that you look up different examples online and try to use them when looking at the exercise below.





In [None]:
from xgboost import XGBClassifier
xgbclf=XGBClassifier(use_label_encoder=False,eval_metric='error') # this is stop a warning that doesn't apply to us anyway - try removing
xgbclf.fit(X_train,y_train)

pred=xgbclf.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

plot_decision_boundary(bclf,X_train,y_train, axes=[-1.8, 2.7, -1.5, 1.8])
pl.show()

## Exercise

Use random forests and BDTs to look again at MNIST classification problem. Look at different parameter - possibly even do a grid scan to see hwt the best ones are. How doe the time compare?

OK **-- honesty alert --** when I came to do this I had problems with a simplistic in an sklearn framework so last night i reverted to  an old fashioned XBG approach to setting the problem, partially nicked from kaggle -- see below (it was late etc etc). However, it should be possible with the simplistic sklearn approach with the right options and I would be impressed to see somebody's solution along those lines.

In [None]:
import xgboost as xgb

param_list = [("eta", 0.08), ("max_depth", 6), ("subsample", 0.8),
              ("colsample_bytree", 0.8), ("objective", "multi:softmax"), 
              ("eval_metric", "merror"), ("alpha", 8), ("lambda", 2), ("num_class", 10)]
n_rounds =  600
early_stopping = 50
    
d_train = xgb.DMatrix(X_train_m, label=y_train_m)
d_val = xgb.DMatrix(X_test_m, label=y_test_m)
eval_list = [(d_train, "train"), (d_val, "validation")]
bst = xgb.train(param_list, d_train, n_rounds, evals=eval_list, early_stopping_rounds=early_stopping, verbose_eval=True)

In [None]:
d_test = xgb.DMatrix(data=X_test_m)
pred=bst.predict(d_test)

In [None]:
display(pred[0:30])
display(y_test_m[0:30])
yf=[float(i) for i in y_test_m]

print(accuracy_score(yf, pred))


In [None]:
t=0
i=0
while t == 0:
    if yf[i] != pred[i] and i != 63 and i!= 124 and i != 241 and i != 247 and i!=259: #just picking a few out
        t=1
        some_digit = X_test_m[i] # just to pick an arbitrary figure. Try a different one
        some_digit_image = some_digit.reshape(28, 28)
        plt.imshow(some_digit_image, cmap=mpl.cm.binary)
        plt.axis("off")
        print(yf[i],pred[i],i)
        pl.show()
    i=i+1