# Statistical Data Analysis: Machine Learning Assignment

- Statistical Data Analysis (SPA6328)
- Academic Year: 2020-2021
- Module Organiser: Dr Seth Zenz
- Module Associate: Prof Adrian Bevan

Copyright (C) Queen Mary University of London

## This assessment is for summative feedback.

In this assignment you will analyse the iris data using decision tree based classifiers.  Specifically we are using the AdaBoost decision tree.  Each decision tree by itself is considered a weak learner, and the AdaBoost computes an output based on a collection of trees. This weighted averaging process leads to a more robust machine learning algorithm (i.e. one that is more robust in that its outputs on new unseen data samples should be similar to that used for testing and training, and that overtraining issues are reduced).  The process also results in what is called a strong learner - one that has a stronger separation between different categories of event than would be the case with a single tree, or indeed the individual features that are input into that tree.

By now you should be very familiar with the iris data, both in terms of the 1D and 2D information, and what you can learn from the 1D distributions in terms of the ability to separate the three different types of iris from each other.  Here we take the next step to use a machine learning algorithm to simultaneously benefit from the 4-dimensional feature space to separate signal from background.

## Task

Train a classifier using the iris data and study the performance characteristics of this classifier in detail by working through this notebook.
You should:
- Add your name and student ID to this solution.
- Work through the Iris data decision tree classification example in order to answer the following questions
  - Using a train split of 0.5. Explore the effect of (a) changing the number of estimators, and (b) changing the tree depth, on the performance of the classifier. For this exercise tabulate results for including 10, 100, 500 and 1000 estimators (i.e. boosting iterations) and for tree depths of 1, 2, 3.  Measure performance by the fraction of mis-classified test examples.
  - Repeat the above using a train split of 0.8.
  - What is the configuration that leads to the least number of mis-classified examples.
  - Why do you think, in detail, that any residual example(s) are mis-classified by the algorithm. If there are no-mis-classified examples, then does that concern you with regard to the use of this algorithm and the sample sizes used for train and test. You may wish to reflect on the earlier formative assignments and the notes to guide your response to this question.
  - Reflect on the results you obtained for the train split size. Remark on any differences in performance that you observe.


 * Name:        Henry Atkins
 * Student ID:  180196054

---------------------------
## Solution

Table exploring the effect of (a) changing the number of estimators, and (b) changing the tree depth, on the performance of the classifer. For this exercise tabulate results for including 10, 100, 500 and 1000 estimators (i.e. boosting iterations) and for tree depths of 1, 2, 3.  Measure performance by the fraction of mis-classified test examples.


----------------
Test split = 0.5
| Tree Depth  | 10 iter  | 100 iter  | 500 iter  | 1000 iter |
| ----------- | -------- | --------- | --------- | --------- |
| 1           | 6.0      | 6.0       | 6.0       | 6.0       |
| 2           | 3.0      | 3.0       | 3.0       | 3.0       |
| 3           | 3.0      | 3.0       | 3.0       | 3.0       |

----------------
Test split = 0.8
| Tree Depth  | 10 iter  | 100 iter  | 500 iter  | 1000 iter |
| ----------- | -------- | --------- | --------- | --------- |
| 1           | 1.0      | 3.0       | 3.0       | 3.0       |
| 2           | 0.0      | 1.0       | 1.0       | 0.0       |
| 3           | 0.0      | 1.0       | 1.0       | 1.0       |


From the results obtained State the least number of mis-classified examples:

   min(N-misclassified) = 0.8 Weighting, 10 Iterations, with Tree Depths 2, 3 as well as 1000 Iterations and 2 Trees.

Reflect on this outcome

   See Conclusion section below



In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix
import seaborn as sns

iris = load_iris()

def misclassified(train_split_size, number_of_estimators, tree_depth):
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0, train_size=train_split_size)
    DT_clf  = DecisionTreeClassifier(max_depth=tree_depth, min_samples_leaf=1)
    BDT_clf = AdaBoostClassifier(base_estimator=DT_clf, n_estimators=number_of_estimators).fit(X_train, y_train)
    test_score  = BDT_clf.score(X_test, y_test)
    incorrect = round((1-test_score)*100, 2)
    return(incorrect)

def compare(train_split_size, number_of_estimators, tree_depth):
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0, train_size=train_split_size)
    DT_clf  = DecisionTreeClassifier(max_depth=tree_depth, min_samples_leaf=1)
    BDT_clf = AdaBoostClassifier(base_estimator=DT_clf, n_estimators=number_of_estimators).fit(X_train, y_train)
    train_score = BDT_clf.score(X_train, y_train)
    test_score  = BDT_clf.score(X_test, y_test)
    return(round((1-train_score)*100, 3), round((1-test_score)*100, 3))

np.set_printoptions(suppress=True)

In [2]:
print("Misclassified percentage for 0.5 training weightings:" + "\n")
array50 = np.array([["TD", "10", "100", "500", "1000"], 
                 ["1", misclassified(0.5, 10, 1), misclassified(0.5, 100, 1), misclassified(0.5, 500, 1), misclassified(0.5, 1000, 1)], 
                 ["2", misclassified(0.5, 10, 2), misclassified(0.5, 100, 2), misclassified(0.5, 500, 2), misclassified(0.5, 1000, 2)], 
                 ["3", misclassified(0.5, 10, 3), misclassified(0.5, 100, 3), misclassified(0.5, 500, 3), misclassified(0.5, 1000, 3)]])
print(array50)

Misclassified percentage for 0.5 training weightings:

[['TD' '10' '100' '500' '1000']
 ['1' '8.0' '8.0' '8.0' '8.0']
 ['2' '4.0' '4.0' '4.0' '4.0']
 ['3' '4.0' '4.0' '4.0' '4.0']]


In [3]:
print("Misclassified percentage for 0.8 training weightings:" + "\n")
array80 = np.array([["TD", "10", "100", "500", "1000"], 
                 ["1", misclassified(0.8, 10, 1), misclassified(0.8, 100, 1), misclassified(0.8, 500, 1), misclassified(0.8, 1000, 1)], 
                 ["2", misclassified(0.8, 10, 2), misclassified(0.8, 100, 2), misclassified(0.8, 500, 2), misclassified(0.8, 1000, 2)], 
                 ["3", misclassified(0.8, 10, 3), misclassified(0.8, 100, 3), misclassified(0.8, 500, 3), misclassified(0.8, 1000, 3)]])
print(array80)

Misclassified percentage for 0.8 training weightings:

[['TD' '10' '100' '500' '1000']
 ['1' '3.33' '10.0' '10.0' '10.0']
 ['2' '0.0' '3.33' '3.33' '0.0']
 ['3' '0.0' '0.0' '3.33' '3.33']]


## Conclusion ## 

The minimum number of misclassified examples lies in the 0.8 test/training weighting table. There is zero mis-classified for:
(Iterations, Tree Depth)=(10, 2),(10, 3), (1000, 2). 

This is due to the high amount of training the algorithm recieves, and the many trees. The Iterations which are greater than 100 (ie 500, 1000) show a large amount of overfitting to the training set. These score highly on the Training set, but not on the test set, as seen below. This shows that the training score is 100%, but the misclassified test set gets 1 wrong each. There is also another 0 mis-classified point, at (1000, 2) which appears to be another optimal point where the balence of Tree Depth and Iterations allows for accurate fitting, but not overfitting.  


These results show some promise, a misclassification of 3.3% is good. But, there is evidence of overfitting, and so highly accurate parameter combinations should be considered critically. Please note there is an element of randomness here, as each time it is repeated, sometimes (500, 2) and (100, 3) show zero test errors. There are several oscillating values in the 80% training set, which indicates some random events. This could be due to a non-gaussian fit, or (as I belive) the "random_state=0" parameter. The analysis still holds, but some of the specific numbers mentioned may be different to quotes, most are oscillating between 0.00% and 3.33%.   

In [4]:
print("Over-Fitting Proof: (Train Errors, Test Errors)")
print("0.8, 500, 3:       ", compare(0.8, 500, 3))
print("0.8, 1000, 3:      ", compare(0.8, 1000, 3))
print("Overfitting if Train Errors >> Test Errors")

Over-Fitting Proof: (Train Errors, Test Errors)
0.8, 500, 3:        (0.0, 3.333)
0.8, 1000, 3:       (0.0, 3.333)
Overfitting if Train Errors >> Test Errors
