# Tutorial: ensemble.py

First, import the required packages

In [1]:
import numpy as np
from ensemble import AdaboostTrees
from ensemble import BaggedTrees
from ensemble import RandomForest

The dataset used is whether someone will play tennis based on weather factors: outlook, temperature, humidity, and wind. These are all categorical variables. The label is binary. Below the dataset is created, as well as the dictionary of possible values each feature can take. Both are required to fit a decision tree. Although it is the same dataset used in the Decision Tree tutorial, the binary label is now {-1, 1} instead of {0, 1} due to the implementation requirements of each ensemble method.

In [2]:
data = np.array([['S', 'H', 'H', 'W', -1],
                ['S', 'H', 'H', 'S', -1],
                ['O', 'H', 'H', 'W', 1],
                ['R', 'M', 'H', 'W', 1],
                ['R', 'C', 'N', 'W', 1],
                ['R', 'C', 'N', 'S', -1],
                ['O', 'C', 'N', 'S', 1],
                ['S', 'M', 'H', 'W', -1],
                ['S', 'C', 'N', 'W', 1],
                ['R', 'M', 'N', 'W', 1],
                ['S', 'M', 'N', 'S', 1],
                ['O', 'M', 'H', 'S', 1],
                ['O', 'H', 'N', 'W', 1],
                ['R', 'M', 'H', 'S', -1]])

Attrs = {
    "Outlook"       : ["S", "O", "R"],
    "Temperature"   : ["H", "M", "C"],
    "Humidity"      : ["H", "N", "L"],
    "Wind"          : ["S", "W"]
}

X = data[:,:-1]
y = data[:,-1].astype(int)

Now that we have the required inputs to construct a decision tree, we can create our ensembles. The algorithms implemented are Adaboost Trees, Bagged Trees, and Random Forests. The practical differences for each model are the definition of the classifier and the perturbation of the original dataset. So, tuning the hyperparameters is a matter of defining the classifier, or in this case, designing the decision tree classifier used as each individual model in the ensemble.

First, we define a model as 10 boosted decision stumps, the classifier and ensemble size as the default values. With this model, the predictions are computed and compared to the observed labels.

In [3]:
model = AdaboostTrees()

model.fit(X, y, Attrs)

preds = model.predict(X, Attrs)

train_error = 1 - np.mean(preds == y)

print(f"Adaboost Decision Stumps Train Error: {round(100 * train_error, 2)}%")

Adaboost Decision Stumps Train Error: 28.57%


Next, we define a bagged trees model. Defining the model and fitting/predicting is exactly the same as before. This time, we will use an ensemble size of 20.

In [4]:
model = BaggedTrees(n_classifiers=20)

model.fit(X, y, Attrs)

preds = model.predict(X, Attrs)

train_error = 1 - np.mean(preds == y)

print(f"Bagged Decision Trees Train Error: {round(100 * train_error, 2)}%")

Bagged Decision Trees Train Error: 14.29%


Finally, we define a random forest model. Again, defining and using the model is the same as the other ensemble methods.

In [5]:
model = RandomForest()

model.fit(X, y, Attrs)

preds = model.predict(X, Attrs)

train_error = 1 - np.mean(preds == y)

print(f"Random Forest Train Error: {round(100 * train_error, 2)}%")

Random Forest Train Error: 14.29%


This concludes the the tutorial for ensemble.py.