# Test - Mondrian Forest Classification

This notebook contains a simple test of the classification functions of the Modrian Forest code written by B. Lakshminarayanan. Specifically it runs the MF code on the USPS dataset discussed in [the original paper](http://www.gatsby.ucl.ac.uk/~balaji/mondrian_forests_nips14.pdf).  In addition, we run a sklearn Random Forest and Extra Trees Classifier for comparison.

In short, I find the MF code indeed produces similar classification performance as the other random forests, however it is strikingly slower compared to the batch counterparts.  While Online RF models may indeed be very slow, this is much slower that desired for this size/dimension data.  Not sure if code refactoring would be a game changer, I suspect yes.

In [3]:
# import some libraries
from src.mondrianforest_utils import load_data, reset_random_seed, precompute_minimal
from src.mondrianforest import process_command_line, MondrianForest

import pydot
import numpy as np
import pprint as pp     # pretty printing module

## Punking the command line

I do not personally enjoy the command line interface of the MF code.  Below is a class containing the settings necessary to run the code.

In [4]:
class ParamSettings(object):
    def __init__(self):
        self.dataset = 'usps'
        self.normalize_features = 1
        self.select_features = 0
        self.optype = 'class'
        self.data_path = './process_data/'
        self.debug = 0
        self.op_dir = 'results'
        self.tag = ''
        self.save = 0
        self.verbose = 1
        self.init_id = 1
        self.n_mondrians = 20
        self.budget = -1 # -1 sets lifetime to inf
        self.discount_factor = 10 # for NSP prior
        self.n_minibatches = 2
        self.draw_mondrian = 0
        self.smooth_hierarchically = 1
        self.store_every = 0
        self.bagging = 0
        self.min_samples_split = 2
        self.name_metric = 'acc'
        
        if self.optype == 'class':
            self.alpha = 0    # normalized stable prior
            assert self.smooth_hierarchically
            
        if self.budget < 0:
            self.budget_to_use = np.inf
        else:
            self.budget_to_use = settings.budget
        
settings = ParamSettings()

In [5]:
%matplotlib inline
PLOT = False

print 'Current settings:'
pp.pprint(vars(settings))

# Resetting random seed
reset_random_seed(settings)

# Loading data
data = load_data(settings)

Current settings:
{'alpha': 0,
 'bagging': 0,
 'budget': -1,
 'budget_to_use': inf,
 'data_path': './process_data/',
 'dataset': 'usps',
 'debug': 0,
 'discount_factor': 10,
 'draw_mondrian': 0,
 'init_id': 1,
 'min_samples_split': 2,
 'n_minibatches': 2,
 'n_mondrians': 20,
 'name_metric': 'acc',
 'normalize_features': 1,
 'op_dir': 'results',
 'optype': 'class',
 'save': 0,
 'select_features': 0,
 'smooth_hierarchically': 1,
 'store_every': 0,
 'tag': '',
 'verbose': 1}


# Run RF and ERT models

Use the scikit versions of Random Forest and Extremely Randomized Trees to get comparable numbers for time and accuracy.

In [8]:
from time import time
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

np.random.seed(1234)
def run_classifier(clf, data):
    """
    Run a sklearn classifier, return time and score
    """
    t0 = time()
    clf.fit(data['x_train'],data['y_train'])
    score = clf.score(data['x_test'], data['y_test'])
    run_time = time()-t0
    return run_time, score
    
rf_time, rf_score = run_classifier(RandomForestClassifier(n_estimators=100), data)
et_time, et_score = run_classifier(ExtraTreesClassifier(n_estimators=100), data)

# Run the MF

This code is mostly copied from modrianforest_demo.py in this repo, with some small modifications for this notebook

In [9]:
# data loading and cache
data = load_data(settings)
param, cache = precompute_minimal(data, settings)

mf = MondrianForest(settings, data)

# begin training and prediction, timing the overall process
print '\nminibatch\tmetric_train\tmetric_test\tnum_leaves'
t0=time()

# loop over minibatches
for idx_minibatch in range(settings.n_minibatches):
    train_ids_current_minibatch = data['train_ids_partition']['current'][idx_minibatch]
    if idx_minibatch == 0:
        # Batch training for first minibatch
        mf.fit(data, train_ids_current_minibatch, settings, param, cache)
    else:
        # Online update
        mf.partial_fit(data, train_ids_current_minibatch, settings, param, cache)

    # Evaluate
    weights_prediction = np.ones(settings.n_mondrians) * 1.0 / settings.n_mondrians
    train_ids_cumulative = data['train_ids_partition']['cumulative'][idx_minibatch]
    
    # training predictions
    pred_forest_train, metrics_train = \
        mf.evaluate_predictions(data, data['x_train'][train_ids_cumulative, :], \
        data['y_train'][train_ids_cumulative], \
        settings, param, weights_prediction, False)

    # test predictions
    pred_forest_test, metrics_test = \
        mf.evaluate_predictions(data, data['x_test'], data['y_test'], \
        settings, param, weights_prediction, False)
    name_metric = settings.name_metric     # acc or mse
    metric_train = metrics_train[name_metric]
    metric_test = metrics_test[name_metric]
    tree_numleaves = np.zeros(settings.n_mondrians)
    for i_t, tree in enumerate(mf.forest):
        tree_numleaves[i_t] = len(tree.leaf_nodes)
    forest_numleaves = np.mean(tree_numleaves)
    print '%9d\t%.3f\t\t%.3f\t\t%.3f' % (idx_minibatch, metric_train, metric_test, forest_numleaves)
mf_time = time() - t0

print '\nFinal forest stats:'
tree_stats = np.zeros((settings.n_mondrians, 2))
tree_average_depth = np.zeros(settings.n_mondrians)
for i_t, tree in enumerate(mf.forest):
    tree_stats[i_t, -2:] = np.array([len(tree.leaf_nodes), len(tree.non_leaf_nodes)])
    tree_average_depth[i_t] = tree.get_average_depth(settings, data)[0]
print 'mean(num_leaves) = %.1f, mean(num_non_leaves) = %.1f, mean(tree_average_depth) = %.1f' \
        % (np.mean(tree_stats[:, -2]), np.mean(tree_stats[:, -1]), np.mean(tree_average_depth))
print 'n_train = %d, log_2(n_train) = %.1f, mean(tree_average_depth) = %.1f +- %.1f' \
        % (data['n_train'], np.log2(data['n_train']), np.mean(tree_average_depth), np.std(tree_average_depth))


minibatch	metric_train	metric_test	num_leaves
        0	1.000		0.904		1830.350
        1	1.000		0.922		3367.050

Final forest stats:
mean(num_leaves) = 3367.1, mean(num_non_leaves) = 3366.1, mean(tree_average_depth) = 19.7
n_train = 7291, log_2(n_train) = 12.8, mean(tree_average_depth) = 19.7 +- 1.3


# Summary

The accuracy on the UCI usps dataset is similar across the classifiers, and close to the published values.  However, the MF code takes significantly longer to run.

In [10]:
print 'RF time, accuracy = %0.2f, %0.2f' % (rf_time, rf_score)
print 'ET time, accuracy = %0.2f, %0.2f' % (et_time, et_score)
print 'MF time, accuracy = %0.2f, %0.2f' % (mf_time, metric_test)

RF time, accuracy = 9.31, 0.94
ET time, accuracy = 2.77, 0.95
MF time, accuracy = 76.30, 0.92
