# SPORF Tutorial

The purpose of this tutorial is to prove that this pure python implementation of SPORF is identical, in terms of functionality, to the one used in the SPORF paper (Tomita, Tyler M., et al. "Sparse projection oblique randomer forests." Journal of Machine Learning Research 21.104 (2020): 1-39.). In order to do this, this notebook runs this implementation of SPORF on 3 different data sets: hill valley, acute inflammation task 1, and acute inflammation task 2. Cohen's Kappa (fractional decrease in error rate over the chance error rate) is the metric that is being used to compare the implementations. If this implementation has the same kappa values (for the same data sets) as the one in the SPORF paper, we can say with confidence that this implementation is accurate. The datasets used in this notebook all had kappa values of 100 ± 0 in the SPORF paper implementation, which is also what is found when run on this SPORF implementation, as seen below. Thus, we can say with confidence that this implementation of SPORF is accurate.

## Import required packages

In [1]:
import sys
import numpy as np
import pandas as pd

from proglearn.progressive_learner import ProgressiveLearner
from proglearn.voters import TreeClassificationVoter
from proglearn.transformers import TreeClassificationTransformer
from proglearn.transformers import ObliqueTreeClassificationTransformer
from proglearn.deciders import SimpleArgmaxAverage

from sklearn.model_selection import train_test_split, cross_val_score

from sporf_tutorial_functions import *

## SPORF

## Set parameters and run on hill valley without noise data

In [2]:
max_depth = 10
feature_combinations = 2
density = 0.01
reps = 5
n_trees = 10
task_num = 1

kwargs = {"kwargs" : {"max_depth" : max_depth, "feature_combinations" : feature_combinations, "density" : density}}

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/hill-valley/Hill_Valley_without_noise_Training.data", reps, n_trees, task_num,
                            ObliqueTreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  1.0
Accuracy after iteration  1 :  1.0
Accuracy after iteration  2 :  1.0
Accuracy after iteration  3 :  1.0
Accuracy after iteration  4 :  1.0
kappa:  100.0 , error: 0.0


## Set parameters and run on acute inflammation task 1 data

In [3]:
max_depth = 10
feature_combinations = 1.5
density = 0.5
reps = 5
n_trees = 10
task_num = 1

kwargs = {"kwargs" : {"max_depth" : max_depth, "feature_combinations" : feature_combinations, "density" : density}}

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data", reps, n_trees, task_num,
                            ObliqueTreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  1.0
Accuracy after iteration  1 :  1.0
Accuracy after iteration  2 :  1.0
Accuracy after iteration  3 :  1.0
Accuracy after iteration  4 :  1.0
kappa:  100.0 , error: 0.0


## Set parameters and run on acute inflammation task 2 data

In [4]:
max_depth = 10
feature_combinations = 1.5
density = 0.5
reps = 5
n_trees = 10
task_num = 2

kwargs = {"kwargs" : {"max_depth" : max_depth, "feature_combinations" : feature_combinations, "density" : density}}

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data", reps, n_trees, task_num,
                            ObliqueTreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  1.0
Accuracy after iteration  1 :  1.0
Accuracy after iteration  2 :  1.0
Accuracy after iteration  3 :  1.0
Accuracy after iteration  4 :  1.0
kappa:  100.0 , error: 0.0


## Random Forest (RF)

Now we will run the same datasets on a base Random forest. The goal of this is to show how SPORF can clearly outperform or perform as well as the Random Forest algorithm. As seen by the results below, SPORF has a much higher kappa value, than RF, for the hill valley without noise data and has the same value for the acute inflammation data sets. Having a high kappa value is desired since as mentioned above, it is a measure of how much the error rate over the chance error rate decreases.

## Set parameters and run on hill valley without noise data

In [5]:
max_depth = 10
feature_combinations = 2
density = 0.01
reps = 5
n_trees = 10
task_num = 1

kwargs = {"kwargs" : {"max_depth" : max_depth} }

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/hill-valley/Hill_Valley_without_noise_Training.data", reps, n_trees, task_num,
                            TreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  0.5409836065573771
Accuracy after iteration  1 :  0.5901639344262295
Accuracy after iteration  2 :  0.5901639344262295
Accuracy after iteration  3 :  0.6885245901639344
Accuracy after iteration  4 :  0.5245901639344263
kappa:  17.37704918032787 , error: 5.1130724431784715


## Set parameters and run on acute inflammation task 1 data

In [6]:
max_depth = 10
feature_combinations = 1.5
density = 0.5
reps = 5
n_trees = 10
task_num = 1

kwargs = {"kwargs" : {"max_depth" : max_depth} }

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data", reps, n_trees, task_num,
                            TreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  1.0
Accuracy after iteration  1 :  1.0
Accuracy after iteration  2 :  1.0
Accuracy after iteration  3 :  1.0
Accuracy after iteration  4 :  1.0
kappa:  100.0 , error: 0.0


## Set parameters and run on acute inflammation task 2 data

In [7]:
max_depth = 10
feature_combinations = 1.5
density = 0.5
reps = 5
n_trees = 10
task_num = 2

kwargs = {"kwargs" : {"max_depth" : max_depth} }

kappa, err = test("https://archive.ics.uci.edu/ml/machine-learning-databases/acute/diagnosis.data", reps, n_trees, task_num,
                            TreeClassificationTransformer,
                            kwargs)

print("kappa: ", kappa, ", error:", err)

Accuracy after iteration  0 :  1.0
Accuracy after iteration  1 :  1.0
Accuracy after iteration  2 :  1.0
Accuracy after iteration  3 :  1.0
Accuracy after iteration  4 :  1.0
kappa:  100.0 , error: 0.0


## Conclusions

From the results obtained in this notebook, it is possible to conclude that this implementation of SPORF is accurate. Furthermore, it is possible to see how SPORF can be very useful, especially in a model utilizing ensembling. It can do much better than RF on certain datasets while maintaining the high kappa values seen on the datasets that RF performed well on.

In [15]:
max_depth = 10
feature_combinations = 1.5
density = 0.5
reps = 1
n_trees = 10
task_num = 2
sample_size = 400

X_train, y_train = load_simulated_data('Orthant_train.csv')
X_test, y_test = load_simulated_data('Orthant_test.csv')
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test))
n_classes = len(np.unique(y))

# print(len(np.unique(np.concatenate((y_train, y_test)))))
# print(np.amax(np.concatenate((y_train, y_test))) + 1)

kappa = np.zeros(reps)
for i in range(reps):
    # idx = np.random.randint(len(X_train), size=sample_size)
    
    # X_train = X_train[idx,:]
    # y_train = y_train[idx]
    
    # X_train = X_train[:sample_size,:]
    # y_train = y_train[:sample_size]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=sample_size, shuffle=True, stratify=y)
    
    X_test = X_test[:400,:]
    y_test = y_test[:400]
            
    kwargs = {"kwargs" : {"max_depth" : max_depth, "feature_combinations" : feature_combinations, "density" : density}}

    default_decider_kwargs = {"classes": np.arange(n_classes)}

    pl = ProgressiveLearner(
        default_transformer_class=ObliqueTreeClassificationTransformer,
        default_transformer_kwargs=kwargs,
        default_voter_class=TreeClassificationVoter,
        default_voter_kwargs={},
        default_decider_class=SimpleArgmaxAverage,
        default_decider_kwargs=default_decider_kwargs)

    pl.add_task(X_train, y_train, num_transformers=n_trees)

    y_hat = pl.predict(X_test, task_id=0)

    acc = np.sum(y_test == y_hat) / len(y_test)
    print("Accuracy after iteration ", i, ": ", acc)

    chance_pred = 1 / n_classes
    kappa[i] = (acc - chance_pred) / (1 - chance_pred)

kap = np.mean(kappa) * 100
err = (np.std(kappa) * 100) / np.sqrt(reps)

print("kappa: ", kap, ", error:", err)

ValueError: could not broadcast input array from shape (400,58) into shape (400)