In [1]:
import numpy as np
import pandas
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
%matplotlib notebook

# Decision tree (categorical)

Let's try to construct a decision tree for the weather data. If we stick to `scikit`, unfortunately, we already hit a big limitation of its decision tree implementation:

* `scikit` decision trees are only for **numeric** data!!

Our weather data, however, is categorical, so that we now need to do the attribute encoding that was discussed earlier.

We **cannot** simply replace strings as values (i.e., "Sunny" = 1, "Rainy" = 2, etc.), since `scikit` actually treats these values as numbers, but our data has **no** ordering. If our data would be **ordinal**, we could do this, since that would make sense (i.e., "Worst" = -2, "Neutral" = 0, "Best" = 2, for example).

So, we have to encoding our values in a **one-hot encoding**. For this, we can use a variety of approaches. Let's do `pandas` for now.

In [2]:
# read in our weather data
rawData = pandas.read_csv('data/dataWeather.txt',delimiter='\t')

# the labels are in the last column
label = rawData['Play']
# the actual "data" is in all other columns
tmp = rawData.iloc[:,0:-1]

# now convert the data to one-hot encoding!
data = pandas.get_dummies(tmp)

# construct a decision tree model using the ID information gain
dt = tree.DecisionTreeClassifier(criterion='entropy')

# fit it to our data
dt.fit(data,label)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Now we have constructed a tree and fit it to our weather data. 

We would now like to test that tree with a new day (e.g., "Sunny", "Cool", "High", and "True") to see whether we should play or not.

Unfortunately, in order to test this, we also need to convert that day into a one-hot encoded input vector. But how can we do this? 

As always, there is a genius answer: we do the one-hot encoding for our original data, we do another for the new day, and then we **reindex** the resulting dummy variable with those coming from the original encoding.

In [3]:
# test with a new day

# first do the one-hot encoding for the new day
tmp = pandas.get_dummies(pandas.DataFrame({'Outlook':['Sunny'],'Temperature':['Cool'],'Humidity':['High'],'Windy':[True]}))

# now re-index this with the original encoding, making sure to 
# add "0" to every column that does NOT appear in our tmp variable!
newDay=tmp.reindex(columns = data.columns, fill_value=0)

# finally, we can predict the decision:
print('on the tested day, the decision to Play is:',dt.predict(newDay)[0],'with',np.max(dt.predict_proba(newDay)),'probability')

on the tested day, the decision to Play is: No with 1.0 probability


Now let's visualize the tree using `export_graphviz` and `pydotplus`. We also select a few options to make a nicer, readable output

In [4]:
import pydotplus 
from IPython.display import Image
# export the tree using the correct feature names
# colored by purity
dot_data = tree.export_graphviz(dt,out_file=None,
                                feature_names=data.columns,
                                rounded=True,filled=True)
# convert this to a picture structure
graph = pydotplus.graph_from_dot_data(dot_data)  
# show this picture as a PNG in the browser
Image(graph.create_png())

AttributeError: 'NoneType' object has no attribute 'write'

As we can see, the tree does select the Outlook attribute first, followed by Humidity. Since this is a binary tree, however, the splits do look different than the one in the powerpoint material. 

In general, the `scikit` implementation leaves a bit to be desired. If you want more control over your decision trees, you should download a more advanced package, such as:

`Python Decision Tree 3.4.3` at https://engineering.purdue.edu/kak/distDT/DecisionTree-3.4.3.html

## Training error

Finally, let's take a look at the training error. Since we constructed the tree by default to keep splitting until all leaves are "pure", it will always produce 0 error. As we will see, especially for big problems, this may lead to "overfitting". One way to avoid this is to reduce the depth of the tree.

We also construct a tree of depth 2 and see its error.

In [5]:
print('got',np.size(np.where(dt.predict(data)!=label)),'training errors for the full tree')

# construct a decision tree model using the ID information gain
dtSmall = tree.DecisionTreeClassifier(criterion='entropy',max_depth=2)

# fit it to our data
dtSmall.fit(data,label)

print('got',np.size(np.where(dtSmall.predict(data)!=label)),'training errors for the small tree')

got 0 training errors for the full tree
got 2 training errors for the small tree


# Decision tree for numeric data

Let's use `scikit` to train a tree on numeric data. Let's first use a very simple two point example with a point at (0,0) that belongs to "classA" and another point at (1,1) that belongs to "classB":

In [6]:
# make simple data
data = [[0,0],[1,1]]
# give the points a label
label = ['classA','classB']

# construct a decision tree model using the ID information gain
dt = tree.DecisionTreeClassifier(criterion='entropy')

# fit it to our data
dt.fit(data,label)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Now we've fit the tree, let's test this with two nearby points:

In [7]:
test0 = dt.predict([[1.3,1.3]])
test1 = dt.predict([[0.1,0.2]])

print('predicted {:s} for first point and {:s} for second'.format(test0[0],test1[0]))

predicted classB for first point and classA for second


Let's try to visualize the decision space of the tree in x,y coordinates. For this, we first construct a `meshgrid` of points:

In [8]:
(x,y)=(np.linspace(-0.5,1.5,21),np.linspace(-0.5,1.5,21))
(xv,yv)=np.meshgrid(x,y)

Unfortunately, while this is very good for function evaluation and plotting, the format of `meshgrid` is not good for the `predict` method of the decision tree in `scikit`. 

So, we need to combine the coordinates into pairs of x,y, which is done by the `ravel` command from `numpy`. The output of this is converted to a `numpy` array and fed in the tree predictor:

In [9]:
test = dt.predict(np.array([xv.ravel(), yv.ravel()]).T)

We now will draw the decision space by substituting "0" for "classA" and "1" for "classB". The result needs to be reshaped into the original `meshgrid` format and then we can plot the contours of the different classification outputs as follows: 

In [10]:
test[test=='classA']=0
test[test=='classB']=1
test = test.reshape(xv.shape)
fig=plt.figure()
plt.contourf(xv, yv, test)

<IPython.core.display.Javascript object>

<matplotlib.contour.QuadContourSet at 0xb2a5940>

As we can see, the plane is split into two sections [whether this split is horizontal or vertical is random!]

Let's take a look at the tree:

In [11]:
import pydotplus 
from IPython.display import Image
# export the tree colored by purity
dot_data = tree.export_graphviz(dt,out_file=None,
                                feature_names=['X','Y'],
                                rounded=True,filled=True)
# convert this to a picture structure
graph = pydotplus.graph_from_dot_data(dot_data)  
# show this picture as a PNG in the browser
Image(graph.create_png())

AttributeError: 'NoneType' object has no attribute 'write'

Yeah, well...


## XOR-tree

Let's try to see what the famous XOR problem does to the tree: 

In [None]:
# make simple data
data = [[0,0],[1,1],[0,1],[1,0]]
# give the points a label
label = ['classA','classA','classB','classB']

# construct a decision tree model using the ID information gain
dt = tree.DecisionTreeClassifier(criterion='entropy')

# fit it to our data
dt.fit(data,label)

# test it with a range of coordinates
test = dt.predict(np.array([xv.ravel(), yv.ravel()]).T)
test[test=='classA']=0
test[test=='classB']=1
test = test.reshape(xv.shape)
fig=plt.figure()
plt.contourf(xv, yv, test)

# export the tree colored by purity
dot_data = tree.export_graphviz(dt,out_file=None,
                                feature_names=['X','Y'],
                                rounded=True,filled=True)
# convert this to a picture structure
graph = pydotplus.graph_from_dot_data(dot_data)  
# show this picture as a PNG in the browser
Image(graph.create_png())

As we can see, the tree correctly classifies the problem. It can only do this by having a replicated sub-tree.

In general, such trees can carve arbitrary boxes out of the feature space.

## DIY: Decision tree on the IRIS data

Let's try the IRIS data with our decision tree. This is purely numerical, so the `scikit` implementation has no problem. We will first load the data and then select a sub-set for training and another for testing:

In [None]:
# load iris data
iris = 

# we know that the data has 150 flowers, 50 flowers for each
# of the three categories. I would like to get a subset of 
# each flower for training, and the remainder for testing

# select the first 40 flowers of each category for training
trainIdx = 
# select the remaining 10 flowers of each category for testing
testIdx = 

# construct the default tree
dtIris = 

# fit it to our training data
dtIris =

Good, now let's predict the results on the test data and evaluate how good we are. 

Let's also visualize the tree (see above):

In [None]:
test = dtIris.predict(iris.data[testIdx])
print('got',np.size(np.where(test!=iris.target[testIdx])),'training errors for the tree')

# export the tree colored by purity
dot_data = tree.export_graphviz(dtIris,out_file=None,
                                feature_names=iris.feature_names,
                                class_names=iris.target_names,
                                rounded=True,filled=True)
# convert this to a picture structure
graph = pydotplus.graph_from_dot_data(dot_data)  
# show this picture as a PNG in the browser
Image(graph.create_png())

The beauty of this is that we can actually see that one full class ("setosa") can be predicted by just asking one simple question on the "petal length". Separating the "virginica" from the "versicolor" is a little bit more involved.

Note, that we've trained on only a subset of the data and tested on another, independent, test set. We will talk about this a lot more soon!