# CS361 Assignment 1

In these experiments, we endeavour to find the best ensemble learning algorithm for the Arrhythmia, Caesarian and Website-Phishing datasets respectively, out of the Decision Tree, Random Forest, Bagging and AdaBoost methods.

For Task 2, I chose the GradientBoosting algorithm, and implemented it along with the other four prior algorithms mentioned on the following datasets.

In [70]:
#Loading Libraries

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.datasets

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from autorank import autorank, plot_stats, create_report, latex_table
from statistics import stdev

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, VotingClassifier, GradientBoostingClassifier


# Render matplotlib plots in the notebook
%matplotlib inline



We set a random state to ensure that out experiments are reproducible.

In [57]:
#Setting Random State
rs = 1234
np.random.seed(rs)

## Arrhythmia Analysis

We began by splitting each dataset into their feature set (X) and target set (y). For all datasets, the target was the final column ‘class’

In [58]:
#Load Arrhythmia Dataset
df = pd.read_csv('arrhythmia.csv', na_values = ['?'] ) #Converts '?' values to NaN
X = df.loc[:, 'age':'chV6_QRSTA']
y = df.loc[:, 'class']

X = np.nan_to_num(X) #Converts remaining NA values to '0'
print(X.shape, y.shape)

(452, 279) (452,)


From this, we distinguished the training and test set at an 80:20 split.  As the name suggests, training data teaches the model, whereas test data provides insight into how well the model generalises to unseen data. Splitting the data into 80% training and 20% testing means the model will assess more known cases and hopefully come to an accurate solution, when tested against the unseen data. This is only appropriate for large datasets that would have enough data to enable a training/test split and be able to learn general principles from the training set, which is luckily provided by the datasets required in this assignment.

In [59]:
#Distinguish Training and Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = rs)
print(X_train.shape, X_test.shape)

(361, 279) (91, 279)


### Ensemble Algorithms

Now we run the ensemble algorithms on the training set.

In [71]:
#Ensemble Algorithms

#Straight Decision Tree
dt = DecisionTreeClassifier(random_state = rs)
dt.fit(X_train, y_train)

#Random Forest
rf = RandomForestClassifier(n_estimators = 10, random_state = rs)
rf.fit(X_train, y_train)

#AdaBoost
adb = AdaBoostClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10, learning_rate=1)
adb.fit(X_train, y_train)

#Bagging
bg = BaggingClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10)
bg.fit(X_train, y_train)

#Gradient Boosting
gb = GradientBoostingClassifier(random_state = rs, n_estimators = 10)
gb.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=10, random_state=1234)

Next we derive the accuracy score of each algorithm for the training set.

In [72]:
#Accuracy Scores

#Standard Decision Tree Score
dt_score = dt.score(X_train, y_train)
print(f'Decision Tree Accuracy on train set: {dt_score:.2f}')

dt_score = dt.score(X_test, y_test)
print(f'Decision Tree Accuracy on test set: {dt_score:.2f}')
print('')

#Random Forest Score
rf_score = rf.score(X_train, y_train)
print(f'Random Forest Accuracy on train set: {rf_score:.2f}')

rf_score = rf.score(X_test, y_test)
print(f'Random Forest Accuracy on test set: {rf_score:.2f}')
print('')

#AdaBoost Score
adb_score = adb.score(X_train, y_train)
print(f'AdaBoost Accuracy on train set: {adb_score:.2f}')

adb_score = adb.score(X_test, y_test)
print(f'AdaBoost Accuracy on test set: {adb_score:.2f}')
print('')

#Bagging Score
bg_score = bg.score(X_train, y_train)
print(f'Bagging Accuracy on train set: {bg_score:.2f}')

bg_score = bg.score(X_test, y_test)
print(f'Bagging Accuracy on test set: {bg_score:.2f}')
print('')

#Gradient Boosting
gb_score = gb.score(X_train, y_train)
print(f'Gradient Boosting Accuracy on train set: {gb_score:.2f}')

gb_score = gb.score(X_test, y_test)
print(f'Gradient Boosting on test set: {gb_score:.2f}')
print('')


Decision Tree Accuracy on train set: 1.00
Decision Tree Accuracy on test set: 0.60

Random Forest Accuracy on train set: 0.99
Random Forest Accuracy on test set: 0.59

AdaBoost Accuracy on train set: 1.00
AdaBoost Accuracy on test set: 0.63

Bagging Accuracy on train set: 0.98
Bagging Accuracy on test set: 0.71

Gradient Boosting Accuracy on train set: 0.95
Gradient Boosting on test set: 0.62



### Cross Validation

Then we proceed to cross-validate. Cross-validation evaluates models on the unseen data in the test set, by resampling. In particular, cross-validation splits a given data sample at parameter k and proceeds to run iterations of the experiment on each unique group (it splits them again into training and test data, fits a model on the training, evaluates on the test set and retains the evaluation score - discarding the model). This summarises the skill of the model using the sample of model evaluation scores. 

In my experiment, I chose parameter k=10 as it is a common tactic that has been found to generally result in a model skill estimate with low bias and a modest variance, thus resulting in a 10-fold cross validation for each classifier. 

In [73]:
#Cross Validation
means = []
sds = []

#Decision Trees
dt_scores = cross_val_score(DecisionTreeClassifier(random_state = rs), X, y, cv=KFold(10))
#print(dt_scores)
print(f"Decision Tree Mean test accuracy: {np.mean(dt_scores):.3f}")
means += [dt_scores]

std = stdev(dt_scores)
sds += [std]
print(f"Decision Tree Standard Deviation: {std:.3f}")
print('')

#Random Forest
rf_scores = cross_val_score(RandomForestClassifier(random_state = rs), X, y, cv=KFold(10))
#print(rf_scores)
print(f"Random Forest Mean test accuracy: {np.mean(rf_scores):.3f}")
means += [rf_scores]


std = stdev(rf_scores)
sds += [std]
print(f"Random Forest Standard Deviation: {std:.3f}")
print('')

#AdaBoost
adb_scores = cross_val_score(AdaBoostClassifier(random_state = rs), X, y, cv=KFold(10))
#print(adb_scores)
print(f"AdaBoost Mean test accuracy: {np.mean(adb_scores):.3f}")
means += [adb_scores]

std = stdev(adb_scores)
sds += [std]
print(f"AdaBoost Standard Deviation: {std:.3f}")
print('')

#Bagging
bg_scores = cross_val_score(BaggingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Bagging Mean test accuracy: {np.mean(bg_scores):.3f}")
means += [bg_scores]

std = stdev(bg_scores)
sds += [std]
print(f"Bagging Standard Deviation: {std:.3f}")
print('')

#Grdient Boosting
gb_scores = cross_val_score(GradientBoostingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Gradient Boosting Mean test accuracy: {np.mean(gb_scores):.3f}")
means += [gb_scores]

std = stdev(gb_scores)
sds += [std]
print(f"Gradient Boosting Standard Deviation: {std:.3f}")
print('')

Decision Tree Mean test accuracy: 0.961
Decision Tree Standard Deviation: 0.022

Random Forest Mean test accuracy: 0.973
Random Forest Standard Deviation: 0.011

AdaBoost Mean test accuracy: 0.933
AdaBoost Standard Deviation: 0.006

Bagging Mean test accuracy: 0.969
Bagging Standard Deviation: 0.014

Gradient Boosting Mean test accuracy: 0.948
Gradient Boosting Standard Deviation: 0.007



### Autorank Report

Finally, I created a data frame comprised of the evaluation scores of each algorithm and their standard deviations to pass through autorank. Autorank (and in particular, the Frequentist approach) determines if the data are normal, the populations are homogenous (equal variances) and the number of populations and selects the appropriate statistical test, effect size and methods for determining the confidence interval of the central tendency from there.

The statistical test chosen by Autorank for the Arrhythmia analysis is the Friedman test. The Friedman test calculates the ranks of the algorithms in each cross-validation fold and then calculates the mean rank of the classifiers over the whole sample. From this it generates the test statistic.

In [63]:
#Autorank time
classifiers = ['Decision Tree', 'Random Forest', 'AdaBoost', 'Bagging', 'Gradient Boosting']

data = pd.DataFrame()
for i in range(5):
     data[classifiers[i]] = np.random.normal(means[i], sds[i], 10).clip(0, 1)

result_frequentist = autorank(data, alpha=0.05, verbose=False, approach='frequentist')
print(result_frequentist)

RankResult(rankdf=
                   meanrank      mean       std  ci_lower  ci_upper  \
Bagging                 2.2  0.720849  0.071371  0.647502  0.794196   
Gradient Boosting       2.4  0.720234  0.054841  0.663874  0.776594   
Random Forest           2.6  0.729677  0.119578  0.606788  0.852566   
AdaBoost                3.8  0.569198  0.178561  0.385693  0.752704   
Decision Tree           4.0  0.640950  0.089287   0.54919  0.732709   

                  effect_size   magnitude  
Bagging                     0  negligible  
Gradient Boosting  0.00966815  negligible  
Random Forest      -0.0896487  negligible  
AdaBoost              1.11529       large  
Decision Tree        0.988525       large  
pvalue=0.02440590052878717
cd=1.9288111473713958
omnibus=friedman
posthoc=nemenyi
all_normal=True
pvals_shapiro=[0.9583105444908142, 0.07404874265193939, 0.2385014146566391, 0.9609372019767761, 0.5190461874008179]
homoscedastic=False
pval_homogeneity=0.005881011843653574
homogeneity_test=b

According to the autorank report, the best algorithm for the Arrhythmia Dataset is the Bagging algorithm. A Bagging algorithm functions by splitting the data into multiple training sets upon which a class of learning or optimising methods such as decision trees and neural networks are applied. After training these multiple models on different samples of the same data, the prediction is averaged into a single summary with the reasoning that the averaging of misclassification errors on different data splits gives a better estimate of the predictive ability of a learning method.

## Caesarian Analysis

We perform the same experimental process to the Caesarian dataset as was done to the Arrhythmia dataset.

In [33]:
#CAESARIAN ANALYSIS
#Load Caesarian Dataset
df = pd.read_csv('caesarian.csv')
#print(df.head())

#Split dataset into features and target variable
X = df.loc[:, 'age':'heart-problem']
y = df.loc[:, 'class']

print(X.shape, y.shape)

(80, 5) (80,)


In [34]:
#Distinguish Training and Test Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = rs)
print(X_train.shape, X_test.shape)

(64, 5) (16, 5)


### Ensemble Algorithms

In [35]:
#Ensemble Algorithms

#Straight Decision Tree
dt = DecisionTreeClassifier(random_state = rs)
dt.fit(X_train, y_train)

#Random Forest
rf = RandomForestClassifier(n_estimators = 10, random_state = rs)
rf.fit(X_train, y_train)

#AdaBoost
adb = AdaBoostClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10, learning_rate=1)
adb.fit(X_train, y_train)

#Bagging
bg = BaggingClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10)
bg.fit(X_train, y_train)

#Gradient Boosting
gb = GradientBoostingClassifier(random_state = rs, n_estimators = 10)
gb.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=10, random_state=1234)

In [36]:
#Accuracy Scores

#Standard Decision Tree Score
dt_score = dt.score(X_train, y_train)
print(f'Decision Tree Accuracy on train set: {dt_score:.2f}')

dt_score = dt.score(X_test, y_test)
print(f'Decision Tree Accuracy on test set: {dt_score:.2f}')
print('')

#Random Forest Score
rf_score = rf.score(X_train, y_train)
print(f'Random Forest Accuracy on train set: {rf_score:.2f}')

rf_score = rf.score(X_test, y_test)
print(f'Random Forest Accuracy on test set: {rf_score:.2f}')
print('')

#AdaBoost Score
adb_score = adb.score(X_train, y_train)
print(f'AdaBoost Accuracy on train set: {adb_score:.2f}')

adb_score = adb.score(X_test, y_test)
print(f'AdaBoost Accuracy on test set: {adb_score:.2f}')
print('')

#Bagging Score
bg_score = bg.score(X_train, y_train)
print(f'Bagging Accuracy on train set: {bg_score:.2f}')

bg_score = bg.score(X_test, y_test)
print(f'Bagging Accuracy on test set: {bg_score:.2f}')
print('')

#Gradient Boosting
gb_score = gb.score(X_train, y_train)
print(f'Gradient Boosting Accuracy on train set: {gb_score:.2f}')

gb_score = gb.score(X_test, y_test)
print(f'Gradient Boosting on test set: {gb_score:.2f}')
print('')



Decision Tree Accuracy on train set: 0.97
Decision Tree Accuracy on test set: 0.50

Random Forest Accuracy on train set: 0.94
Random Forest Accuracy on test set: 0.38

AdaBoost Accuracy on train set: 0.97
AdaBoost Accuracy on test set: 0.38

Bagging Accuracy on train set: 0.94
Bagging Accuracy on test set: 0.38

Gradient Boosting Accuracy on train set: 0.86
Gradient Boosting on test set: 0.50



### Cross Validation

In [41]:
#Cross Validation
means = []
sds = []

#Decision Trees
dt_scores = cross_val_score(DecisionTreeClassifier(random_state = rs), X, y, cv=KFold(10))
#print(dt_scores)
print(f"Decision Tree Mean test accuracy: {np.mean(dt_scores):.3f}")
means += [dt_scores]

std = stdev(dt_scores)
sds += [std]
print(f"Decision Tree Standard Deviation: {std:.3f}")
print('')

#Random Forest
rf_scores = cross_val_score(RandomForestClassifier(random_state = rs), X, y, cv=KFold(10))
#print(rf_scores)
print(f"Random Forest Mean test accuracy: {np.mean(rf_scores):.3f}")
means += [rf_scores]


std = stdev(rf_scores)
sds += [std]
print(f"Random Forest Standard Deviation: {std:.3f}")
print('')

#AdaBoost
adb_scores = cross_val_score(AdaBoostClassifier(random_state = rs), X, y, cv=KFold(10))
#print(adb_scores)
print(f"AdaBoost Mean test accuracy: {np.mean(adb_scores):.3f}")
means += [adb_scores]

std = stdev(adb_scores)
sds += [std]
print(f"AdaBoost Standard Deviation: {std:.3f}")
print('')

#Bagging
bg_scores = cross_val_score(BaggingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Bagging Mean test accuracy: {np.mean(bg_scores):.3f}")
means += [bg_scores]

std = stdev(bg_scores)
sds += [std]
print(f"Bagging Standard Deviation: {std:.3f}")
print('')

#Grdient Boosting
gb_scores = cross_val_score(GradientBoostingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Gradient Boosting Mean test accuracy: {np.mean(gb_scores):.3f}")
means += [gb_scores]

std = stdev(gb_scores)
sds += [std]
print(f"Gradient Boosting Standard Deviation: {std:.3f}")
print('')



Decision Tree Mean test accuracy: 0.500
Decision Tree Standard Deviation: 0.132

Random Forest Mean test accuracy: 0.550
Random Forest Standard Deviation: 0.158

AdaBoost Mean test accuracy: 0.613
AdaBoost Standard Deviation: 0.224

Bagging Mean test accuracy: 0.550
Bagging Standard Deviation: 0.158

Gradient Boosting Mean test accuracy: 0.537
Gradient Boosting Standard Deviation: 0.167



### Autorank Report

The statistical test chosen by Autorank for the Caesarian analysis is the ANOVA test. The ANOVA test splits and tests observed variance data to gain insight into the relationship between dependent and independent variables.

In [55]:
#Autorank
classifiers = ['Decision Tree', 'Random Forest', 'AdaBoost', 'Bagging', 'Gradient Boosting']

data = pd.DataFrame()
for i in range(5):
     data[classifiers[i]] = np.random.normal(means[i], sds[i], 10).clip(0, 1)

result_frequentist = autorank(data, alpha=0.05, verbose=False, approach='frequentist')
print(result_frequentist)


RankResult(rankdf=
                   meanrank      mean       std  ci_lower  ci_upper  \
AdaBoost                2.2  0.654087  0.298911  0.495666  0.812508   
Bagging                 2.9  0.581198  0.280014  0.422777  0.739619   
Random Forest           3.2  0.528778  0.248934  0.370357  0.687199   
Gradient Boosting       3.3  0.535315  0.237468  0.376894  0.693736   
Decision Tree           3.4  0.499500  0.157168  0.341079  0.657921   

                  effect_size   magnitude  
AdaBoost                    0  negligible  
Bagging              0.251674       small  
Random Forest         0.45557       small  
Gradient Boosting    0.439988       small  
Decision Tree        0.647351      medium  
pvalue=0.4167498968974116
cd=None
omnibus=anova
posthoc=tukeyhsd
all_normal=True
pvals_shapiro=[0.7004509568214417, 0.37004631757736206, 0.3947213888168335, 0.5965405702590942, 0.8123753666877747]
homoscedastic=True
pval_homogeneity=0.4491624115731875
homogeneity_test=bartlett
alpha=0.05
a

  ax1.set_yticklabels(np.insert(self.groupsunique.astype(str), 0, ''))


The best algorithm for the Caesarian Dataset according to the mean rank of the Autorank report is the AdaBoost algorithm. Boosting trains each tree on a modified version of the original dataset. AdaBoost does this by assigning higher weights to wrongly classified observations as well as the trained classifiers according to accuracy, while it iteratively trains. This process iterates until complete training data fits without any error.

The characteristics of the Caesarian dataset that enabled the AdaBoost algorithm to outperform the others is due to the relatively small size and simplicity of the dataset since due to the nature of the algorithm, Boosting methods would falter for particularly noisy data.


## Website-Phishing Analysis

In [65]:
#Load Website Phishing Dataset
df = pd.read_csv('website-phishing.csv')
df.columns = df.columns.map(str.strip)
X = df.loc[:, 'having_IP_Address':'Statistical_report']
y = df.loc[:, 'Class']

print(X.shape, y.shape)


(11055, 30) (11055,)


### Ensemble Algorithms

In [66]:
#Ensemble Algorithms

#Straight Decision Tree
dt = DecisionTreeClassifier(random_state = rs)
dt.fit(X_train, y_train)

#Random Forest
rf = RandomForestClassifier(n_estimators = 10, random_state = rs)
rf.fit(X_train, y_train)

#AdaBoost
adb = AdaBoostClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10, learning_rate=1)
adb.fit(X_train, y_train)

#Bagging
bg = BaggingClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10)
bg.fit(X_train, y_train)

#Gradient Boosting
gb = GradientBoostingClassifier(random_state = rs, n_estimators = 10)
gb.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=10, random_state=1234)

In [67]:
#Accuracy Scores

#Standard Decision Tree Score
dt_score = dt.score(X_train, y_train)
print(f'Decision Tree Accuracy on train set: {dt_score:.2f}')

dt_score = dt.score(X_test, y_test)
print(f'Decision Tree Accuracy on test set: {dt_score:.2f}')
print('')

#Random Forest Score
rf_score = rf.score(X_train, y_train)
print(f'Random Forest Accuracy on train set: {rf_score:.2f}')

rf_score = rf.score(X_test, y_test)
print(f'Random Forest Accuracy on test set: {rf_score:.2f}')
print('')

#AdaBoost Score
adb_score = adb.score(X_train, y_train)
print(f'AdaBoost Accuracy on train set: {adb_score:.2f}')

adb_score = adb.score(X_test, y_test)
print(f'AdaBoost Accuracy on test set: {adb_score:.2f}')
print('')

#Bagging Score
bg_score = bg.score(X_train, y_train)
print(f'Bagging Accuracy on train set: {bg_score:.2f}')

bg_score = bg.score(X_test, y_test)
print(f'Bagging Accuracy on test set: {bg_score:.2f}')
print('')

#Gradient Boosting
gb_score = gb.score(X_train, y_train)
print(f'Gradient Boosting Accuracy on train set: {gb_score:.2f}')

gb_score = gb.score(X_test, y_test)
print(f'Gradient Boosting on test set: {gb_score:.2f}')
print('')


Decision Tree Accuracy on train set: 1.00
Decision Tree Accuracy on test set: 0.60

Random Forest Accuracy on train set: 0.99
Random Forest Accuracy on test set: 0.59

AdaBoost Accuracy on train set: 1.00
AdaBoost Accuracy on test set: 0.63

Bagging Accuracy on train set: 0.98
Bagging Accuracy on test set: 0.73

Gradient Boosting Accuracy on train set: 0.95
Gradient Boosting on test set: 0.62



### Cross Validation

In [68]:
#Cross Validation
means = []
sds = []

#Decision Trees
dt_scores = cross_val_score(DecisionTreeClassifier(random_state = rs), X, y, cv=KFold(10))
#print(dt_scores)
print(f"Decision Tree Mean test accuracy: {np.mean(dt_scores):.3f}")
means += [dt_scores]

std = stdev(dt_scores)
sds += [std]
print(f"Decision Tree Standard Deviation: {std:.3f}")
print('')

#Random Forest
rf_scores = cross_val_score(RandomForestClassifier(random_state = rs), X, y, cv=KFold(10))
#print(rf_scores)
print(f"Random Forest Mean test accuracy: {np.mean(rf_scores):.3f}")
means += [rf_scores]


std = stdev(rf_scores)
sds += [std]
print(f"Random Forest Standard Deviation: {std:.3f}")
print('')

#AdaBoost
adb_scores = cross_val_score(AdaBoostClassifier(random_state = rs), X, y, cv=KFold(10))
#print(adb_scores)
print(f"AdaBoost Mean test accuracy: {np.mean(adb_scores):.3f}")
means += [adb_scores]

std = stdev(adb_scores)
sds += [std]
print(f"AdaBoost Standard Deviation: {std:.3f}")
print('')

#Bagging
bg_scores = cross_val_score(BaggingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Bagging Mean test accuracy: {np.mean(bg_scores):.3f}")
means += [bg_scores]

std = stdev(bg_scores)
sds += [std]
print(f"Bagging Standard Deviation: {std:.3f}")
print('')

#Grdient Boosting
gb_scores = cross_val_score(GradientBoostingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Gradient Boosting Mean test accuracy: {np.mean(gb_scores):.3f}")
means += [gb_scores]

std = stdev(gb_scores)
sds += [std]
print(f"Gradient Boosting Standard Deviation: {std:.3f}")
print('')

Decision Tree Mean test accuracy: 0.961
Decision Tree Standard Deviation: 0.022

Random Forest Mean test accuracy: 0.973
Random Forest Standard Deviation: 0.011

AdaBoost Mean test accuracy: 0.933
AdaBoost Standard Deviation: 0.006

Bagging Mean test accuracy: 0.969
Bagging Standard Deviation: 0.014

Gradient Boosting Mean test accuracy: 0.948
Gradient Boosting Standard Deviation: 0.007



### Autorank Report

In [83]:
#Autorank time
classifiers = ['Decision Tree', 'Random Forest', 'AdaBoost', 'Bagging', 'Gradient Boosting']

data = pd.DataFrame()
for i in range(5):
     data[classifiers[i]] = np.random.normal(means[i], sds[i], 10).clip(0, 1)

result_frequentist = autorank(data, alpha=0.05, verbose=False, approach='frequentist')
print(result_frequentist)

RankResult(rankdf=
                   meanrank    median         mad  ci_lower ci_upper  \
Random Forest          2.45  0.998600  0.00207496  0.809945        1   
Gradient Boosting      2.90  0.982654   0.0257176  0.737841        1   
Decision Tree          3.00  0.975905   0.0357232  0.847895        1   
AdaBoost               3.10  0.917552   0.0982089   0.83122        1   
Bagging                3.55  0.957108   0.0635922  0.787674        1   

                  effect_size   magnitude  
Random Forest               0  negligible  
Gradient Boosting    0.874073       large  
Decision Tree        0.896957       large  
AdaBoost              1.16683       large  
Bagging              0.922259       large  
pvalue=0.5207728735201385
cd=1.9288111473713958
omnibus=friedman
posthoc=nemenyi
all_normal=False
pvals_shapiro=[0.022618934512138367, 0.0045369029976427555, 0.05054107680916786, 0.013056491501629353, 0.008861812762916088]
homoscedastic=True
pval_homogeneity=0.9493909315221907
homoge

For the Website-Phishing Dataset the best method from the Autorank report is the Random Forest algorithm. Random Forest is a Bagging algorithm that combines various decision trees to produce a more generalised model.  Individual decision trees are generated using a random selection of attributes at each node to determine the split and during classification, each tree votes with the most popular class returned.

Random forests are more robust to errors and outliers as the generalisation error for a forest converges as long as the number of trees in the forest is large and considers many fewer attributes for each split which makes it efficient for large databases. For this reason, the characteristics of the Website-Phishing dataset that enabled the Random Forest algorithm to outperform the others is due to the vast number of instances and number of attributes present in the multivariate dataset.


## Task 3: Iris Analysis

In [74]:
iris = sklearn.datasets.load_iris()
X = iris.data
y = iris.target

print(X.shape, y.shape)

(150, 4) (150,)


In [75]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)
print(X_train.shape, X_test.shape)

(120, 4) (30, 4)


In [76]:
#Ensemble Algorithms

#Straight Decision Tree
dt = DecisionTreeClassifier(random_state = rs)
dt.fit(X_train, y_train)

#Random Forest
rf = RandomForestClassifier(n_estimators = 10, random_state = rs)
rf.fit(X_train, y_train)

#AdaBoost
adb = AdaBoostClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10, learning_rate=1)
adb.fit(X_train, y_train)

#Bagging
bg = BaggingClassifier(DecisionTreeClassifier(random_state = rs), n_estimators = 10)
bg.fit(X_train, y_train)

#Gradient Boosting
gb = GradientBoostingClassifier(random_state = rs, n_estimators = 10)
gb.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=10, random_state=1234)

In [77]:
#Accuracy Scores

#Standard Decision Tree Score
dt_score = dt.score(X_train, y_train)
print(f'Decision Tree Accuracy on train set: {dt_score:.2f}')

dt_score = dt.score(X_test, y_test)
print(f'Decision Tree Accuracy on test set: {dt_score:.2f}')
print('')

#Random Forest Score
rf_score = rf.score(X_train, y_train)
print(f'Random Forest Accuracy on train set: {rf_score:.2f}')

rf_score = rf.score(X_test, y_test)
print(f'Random Forest Accuracy on test set: {rf_score:.2f}')
print('')

#AdaBoost Score
adb_score = adb.score(X_train, y_train)
print(f'AdaBoost Accuracy on train set: {adb_score:.2f}')

adb_score = adb.score(X_test, y_test)
print(f'AdaBoost Accuracy on test set: {adb_score:.2f}')
print('')

#Bagging Score
bg_score = bg.score(X_train, y_train)
print(f'Bagging Accuracy on train set: {bg_score:.2f}')

bg_score = bg.score(X_test, y_test)
print(f'Bagging Accuracy on test set: {bg_score:.2f}')
print('')

#Gradient Boosting
gb_score = gb.score(X_train, y_train)
print(f'Gradient Boosting Accuracy on train set: {gb_score:.2f}')

gb_score = gb.score(X_test, y_test)
print(f'Gradient Boosting on test set: {gb_score:.2f}')
print('')


Decision Tree Accuracy on train set: 1.00
Decision Tree Accuracy on test set: 1.00

Random Forest Accuracy on train set: 0.98
Random Forest Accuracy on test set: 1.00

AdaBoost Accuracy on train set: 1.00
AdaBoost Accuracy on test set: 1.00

Bagging Accuracy on train set: 0.99
Bagging Accuracy on test set: 1.00

Gradient Boosting Accuracy on train set: 0.99
Gradient Boosting on test set: 1.00



In [88]:
#Cross Validation
means = []
sds = []

#Decision Trees
dt_scores = cross_val_score(DecisionTreeClassifier(random_state = rs), X, y, cv=KFold(10))
#print(dt_scores)
print(f"Decision Tree Mean test accuracy: {np.mean(dt_scores):.3f}")
means += [dt_scores]

std = stdev(dt_scores)
sds += [std]
print(f"Decision Tree Standard Deviation: {std:.3f}")
print('')

#Random Forest
rf_scores = cross_val_score(RandomForestClassifier(random_state = rs), X, y, cv=KFold(10))
#print(rf_scores)
print(f"Random Forest Mean test accuracy: {np.mean(rf_scores):.3f}")
means += [rf_scores]


std = stdev(rf_scores)
sds += [std]
print(f"Random Forest Standard Deviation: {std:.3f}")
print('')

#AdaBoost
adb_scores = cross_val_score(AdaBoostClassifier(random_state = rs), X, y, cv=KFold(10))
#print(adb_scores)
print(f"AdaBoost Mean test accuracy: {np.mean(adb_scores):.3f}")
means += [adb_scores]

std = stdev(adb_scores)
sds += [std]
print(f"AdaBoost Standard Deviation: {std:.3f}")
print('')

#Bagging
bg_scores = cross_val_score(BaggingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Bagging Mean test accuracy: {np.mean(bg_scores):.3f}")
means += [bg_scores]

std = stdev(bg_scores)
sds += [std]
print(f"Bagging Standard Deviation: {std:.3f}")
print('')

#Grdient Boosting
gb_scores = cross_val_score(GradientBoostingClassifier(random_state = rs), X, y, cv=KFold(10))
#print(bg_scores)
print(f"Gradient Boosting Mean test accuracy: {np.mean(gb_scores):.3f}")
means += [gb_scores]

std = stdev(gb_scores)
sds += [std]
print(f"Gradient Boosting Standard Deviation: {std:.3f}")
print('')

Decision Tree Mean test accuracy: 0.953
Decision Tree Standard Deviation: 0.055

Random Forest Mean test accuracy: 0.947
Random Forest Standard Deviation: 0.076

AdaBoost Mean test accuracy: 0.940
AdaBoost Standard Deviation: 0.073

Bagging Mean test accuracy: 0.933
Bagging Standard Deviation: 0.109

Gradient Boosting Mean test accuracy: 0.927
Gradient Boosting Standard Deviation: 0.102



In [89]:
#Autorank time
classifiers = ['Decision Tree', 'Random Forest', 'AdaBoost', 'Bagging', 'Gradient Boosting']

data = pd.DataFrame()
for i in range(5):
     data[classifiers[i]] = np.random.normal(means[i], sds[i], 10).clip(0, 1)

result_frequentist = autorank(data, alpha=0.05, verbose=False, approach='frequentist')
print(result_frequentist)

RankResult(rankdf=
                   meanrank    median        mad  ci_lower  ci_upper  \
Decision Tree          2.25  0.991074  0.0132332  0.859689         1   
Bagging                2.65  0.953300   0.069238  0.701802         1   
AdaBoost               2.95  0.922107  0.0613837  0.779098         1   
Gradient Boosting      3.45  0.991420  0.0127214  0.720478         1   
Random Forest          3.70  0.922444  0.0349123  0.752198  0.969677   

                  effect_size   magnitude  
Decision Tree               0  negligible  
Bagging              0.757846      medium  
AdaBoost              1.55325       large  
Gradient Boosting  -0.0265957  negligible  
Random Forest         2.59958       large  
pvalue=0.18697670748815112
cd=1.9288111473713958
omnibus=friedman
posthoc=nemenyi
all_normal=False
pvals_shapiro=[0.001630133599974215, 0.012223594821989536, 0.24799463152885437, 0.01963883824646473, 0.003663484239950776]
homoscedastic=True
pval_homogeneity=0.6230137720213405
homogen