# Tutorial: decisiontree.py
First, we import the required packages:

In [1]:
import numpy as np
import decisiontree as dt

The dataset used is whether someone will play tennis based on weather factors: outlook, temperature, humidity, and wind. These are all categorical variables. The label is binary. Below the dataset is created, as well as the dictionary of possible values each feature can take. Both are required to fit a decision tree.

In [2]:
data = np.array([['S', 'H', 'H', 'W', 0],
                ['S', 'H', 'H', 'S', 0],
                ['O', 'H', 'H', 'W', 1],
                ['R', 'M', 'H', 'W', 1],
                ['R', 'C', 'N', 'W', 1],
                ['R', 'C', 'N', 'S', 0],
                ['O', 'C', 'N', 'S', 1],
                ['S', 'M', 'H', 'W', 0],
                ['S', 'C', 'N', 'W', 1],
                ['R', 'M', 'N', 'W', 1],
                ['S', 'M', 'N', 'S', 1],
                ['O', 'M', 'H', 'S', 1],
                ['O', 'H', 'N', 'W', 1],
                ['R', 'M', 'H', 'S', 0]])

Attrs = {
    "Outlook"       : ["S", "O", "R"],
    "Temperature"   : ["H", "M", "C"],
    "Humidity"      : ["H", "N", "L"],
    "Wind"          : ["S", "W"]
}

X = data[:,:-1]
y = data[:,-1].astype(int)

Once we have the input data, label vector, and attribute dictionary, the model can be defined and we can train a tree. Here, I define a decision tree using Gini Index to measure the best split, and limit the depth to 1, which is a decision stump. Using the fit function on the defined model, the tree is constructed.

In [7]:
model = dt.DecisionTree(split_metric="gini", max_depth=1)
model.fit(X, y, Attrs)

After the tree is constructed, there are a few things we can do. First, we use the display_links() and display_tree() functions to create a visualization by hand, given that I have not implemented a direct visualization function. With display_links, we can create the general tree structure of links and nodes:

In [8]:
model.display_links()

1 -> 2
Node 2 is a leaf
1 -> 3
Node 3 is a leaf
1 -> 4
Node 4 is a leaf


Then, with display_tree(), we can use the output information to fill in the attributes, rules, and labels for each node/link:

In [9]:
model.display_tree()

Node Details:
ID: 1
Depth: 0
Attribute: Temperature
Rule: None
label: None

Node Details:
ID: 2
Depth: 1
Attribute: None
Rule: H
label: 0

Node Details:
ID: 3
Depth: 1
Attribute: None
Rule: M
label: 1

Node Details:
ID: 4
Depth: 1
Attribute: None
Rule: C
label: 1



Now, we have the constructed tree:


<img src="tree.png">

We can use this tree to make predictions. Here, we compute predictions on the same dataset and calculate the error rate.

In [11]:
preds = model.predict(X, Attrs)

train_error = 1 - np.mean(preds == y)

print(f"Decision Stump Training Error: {round(100 * (train_error), 2)}%")

Decision Stump Training Error: 35.71%


The error rate is quite high for such a small dataset, so next we construct a full tree and calculate the error rate. This is done by omitting the max_depth hyperparameter.

In [12]:
model = dt.DecisionTree(split_metric='gini')
model.fit(X, y, Attrs)

preds = model.predict(X, Attrs)

train_error = 1 - np.mean(preds == y)

print(f"Full Tree Training Error: {round(100 * (train_error), 2)}%")

Full Tree Training Error: 0.0%


This concludes the tutorial for decisiontree.py. You may notice the other DecisionTree arguments ("rand_tree", "n_rand_attrs") were omitted in this introduction. For single trees, these are less practical. These options are required to support random forests in the ensemble methods section, and will be utilized there.