# Ensemble Testing

## Overview

We have created a python module that makes the homogeneous ensemble callable.  The module is `homogeneous_ensemble.py`.

In [1]:
import homogeneous_ensemble as he

The following cell imports some of the required modules from SciKit-Learn, as well as some standard Python modules.

In [35]:
#  Required for SciKit-Learn to function properly
import numpy as np
import pandas as pd

#  SciKit-Learn algorithms
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

#  Standard Python
import time  #  Used for timing algorithms
import random

#  Data Import and Processing
abalone = pd.read_csv("abalone.csv")
oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(abalone[["Sex"]])

abalone = pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).join(abalone)
abalone = abalone.drop("Sex", axis=1)

Below, we split our data into training and validation partitions.

In [3]:
training, valid = he.split_train_test(abalone, 0.2)
training.head()

Unnamed: 0,"(F,)","(I,)","(M,)",Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
3526,0.0,1.0,0.0,0.615,0.46,0.19,1.066,0.4335,0.226,0.33,13
2936,1.0,0.0,0.0,0.575,0.48,0.15,0.8745,0.375,0.193,0.29,12
3002,0.0,1.0,0.0,0.575,0.475,0.17,0.967,0.3775,0.284,0.275,13
2975,1.0,0.0,0.0,0.62,0.5,0.175,1.146,0.477,0.23,0.39,13
529,1.0,0.0,0.0,0.41,0.305,0.1,0.363,0.1735,0.065,0.11,11


We now test our imported algorithm.  We will restrict the ensemble to using only the support vector regression as its base algorithm.

In [23]:
weights, predictors = he.homog_ens(training, 1, 2)
print("The weights for predictor 0 and predictor 1, resp., are:", weights)
print("The intercept for predictor 0 is: ", predictors[0].intercept_)
print("The intercept for predictor 1 is: ", predictors[1].intercept_)

The weights for predictor 0 and predictor 1, resp., are: [0.22903060914341677, 0.2000998925150517]
The intercept for predictor 0 is:  [9.98579125]
The intercept for predictor 1 is:  [9.75556279]


Next, we'll check to see what the validation set looks like.

In [6]:
X_valid = valid.iloc[:,0:-1]
Y_valid = valid.iloc[:,-1]

X_valid.iloc[:5,:]

Unnamed: 0,"(F,)","(I,)","(M,)",Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight
939,1.0,0.0,0.0,0.66,0.505,0.185,1.528,0.69,0.3025,0.441
1326,1.0,0.0,0.0,0.495,0.38,0.12,0.573,0.2655,0.1285,0.144
2428,0.0,0.0,1.0,0.67,0.51,0.18,1.68,0.926,0.2975,0.3935
1047,0.0,0.0,1.0,0.605,0.475,0.14,1.1175,0.555,0.257,0.274
1109,0.0,1.0,0.0,0.35,0.255,0.09,0.1785,0.0855,0.0305,0.0525


We will store the predictions of each predictor in a list.

In [24]:
predictions = []
for p in predictors:
    predictions.append(p.predict(X_valid))
    
predictions[0][0:5]

array([11.08968025,  9.28289035,  9.85818996,  9.35739023,  6.21958492])

The following cell shows how the 'mean_squared_error' procedure is used in Python.  We use our weak predictors for demonstration.

In [27]:
print(mean_squared_error(predictions[0], Y_valid), mean_squared_error(predictions[1], Y_valid))

5.321578092199089 5.421979723989206


The following cell shows how the final predictions and MSE calculations will be made. 

In [29]:
#  We need to initialize a few variables
num = 0  #  Will be the numerator for the final prediction
j = 0  #  Used to access predictors
weight_sum = 0  #  Will be the denominator for the final prediction

while (j < len(weights)):
        num += weights[j]*predictions[j]  #  Numerator from first 'j' terms
        weight_sum += weights[j]          #  Denominator from first 'j' weights
        j+=1
        
#  After the while loop is finished, we'll have the components for our ensembled prediction
guess = num / weight_sum  #  Final predictions
mse = mean_squared_error(guess, Y_valid)  #  MSE for ensemble

print("Ensemble MSE:", mse)

Ensemble MSE: 5.3655418553187095


We will create a table for holding MSE's and algorithm runtimes.

In [30]:
table_1 = [["DecisionTree",0,0],["SVR",0,0],["kNN",0,0],["RandomForest",0,0],["Gradient",0,0]]
cols = ["Algorithm", "MSE", "Time"]
table_1 = pd.DataFrame(table_1, columns=cols)
print("MSE's and Times for ensembles of", 25, "predictors each.")
table_1

MSE's and Times for ensembles of 25 predictors each.


Unnamed: 0,Algorithm,MSE,Time
0,DecisionTree,0,0
1,SVR,0,0
2,kNN,0,0
3,RandomForest,0,0
4,Gradient,0,0


The following cell is needed to avoid pages of deprecation warnings.

In [33]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # setting ignore as a parameter and further adding category

## Comparing Ensemble Algorithms

We are finally ready to compare ensemble methods.  We run a loop to iterate through the homogeneous ensembles with base algorithms of decision trees, support vector regressors, and kNNs.  Following the loop, we run a random forest ensemble  and a gradient boosted ensemble to compare with.  In the first line of code, $n$ can be changed to adjust the number of predictors (or weak predictors) in each ensemble.  After the code is finished, a table will be presented with each respective ensemble's MSE and runtime.

In [34]:
# Change 'n' to desired number of predictors for each ensemble
n = 256

#  Here is the loop for iterating through the homogeneous ensembles
for i in range(0,3):
    
    #  Training
    t0 = time.time()
    weights, predictors = he.homog_ens(training, i, n)
    
    predictions = []
    for p in predictors:
        predictions.append(p.predict(X_valid))
    
    
    num = 0
    j = 0
    weight_sum = 0
    while (j < len(weights)):
        num += weights[j]*predictions[j]
        weight_sum += weights[j]
        j+=1
    
    guess = num / weight_sum
    
    t1 = time.time()        
        
    table_1.loc[i,"MSE"] = mean_squared_error(guess, Y_valid)
    table_1.loc[i,"Time"] = t1-t0

#  Here is the creation of the random forest
t0 = time.time()
train_set, test_set = he.split_train_test(training, 0.2)
X_vars = train_set.iloc[:,:-1]
X_labels = train_set.iloc[:,-1]
Y = X_labels.to_numpy()
X = X_vars.to_numpy()

rf = RandomForestClassifier(n_estimators=n, max_depth=5, max_features=None, bootstrap=False)
rf.fit(X, Y)
t1 = time.time()


table_1.loc[3,"MSE"] = mean_squared_error(rf.predict(X_valid), Y_valid)
table_1.loc[3,"Time"] = t1-t0

#  Here is the creation of the gradient boosted ensemble
t0 = time.time()
train_set, test_set = he.split_train_test(training, 0.2)
X_vars = train_set.iloc[:,:-1]
X_labels = train_set.iloc[:,-1]
Y = X_labels.to_numpy()
X = X_vars.to_numpy()

gb = GradientBoostingClassifier(n_estimators=n)
gb.fit(X, Y)
t1 = time.time()


table_1.loc[4,"MSE"] = mean_squared_error(gb.predict(X_valid), Y_valid)
table_1.loc[4,"Time"] = t1-t0

print("\nMSE's and Times for ensembles of", n, "predictors each")
table_1


MSE's and Times for ensembles of 256 predictors each


Unnamed: 0,Algorithm,MSE,Time
0,DecisionTree,6.169857,2.215339
1,SVR,5.384239,212.035558
2,kNN,6.394593,14.100879
3,RandomForest,7.221823,1.801786
4,Gradient,6.833333,29.39364


## Addressing a Peculiarity

You may have noticed the MSE for the DecisionTree based ensemble is significantly smaller than for the RandomForest.  While  random forests are ensembles of decision trees, these trees use random splits when training.  The DecisionTree in Python uses the "best" splitter by default.  If the "splitter" parameter is changed in the DecisionTree classifier call, we get comparable results with the RandomForest edging the DecisionTree ensemble slighty.