<a href="https://colab.research.google.com/github/kpetridis24/machine-learning-intro/blob/main/EnsembleMethods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [3]:
NAME = "Konstantinos Petridis"
AEM = "9403"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [4]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [5]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7fa9697b35d0>)

In [6]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation. Otherwise, you can use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

In [7]:
from sklearn.datasets import load_iris, make_regression, load_digits, fetch_california_housing, make_hastie_10_2
from sklearn.ensemble import BaggingClassifier, StackingClassifier, RandomForestRegressor, \
 BaggingRegressor, VotingRegressor, AdaBoostRegressor, GradientBoostingClassifier, VotingClassifier, \
 RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, balanced_accuracy_score
from sklearn.datasets import fetch_california_housing 
import time
import numpy as np 
import matplotlib.pyplot as plt

### 1.1 Voting ###
Create a voting classifier which uses three **simple** estimators/classifiers. Test both soft and hard voting and choose the best one. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   Probabilistic Models (Naive Bayes)
*   KNN Models  

In [8]:
# BEGIN CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

cls1 = DecisionTreeClassifier(random_state=RANDOM_STATE) # Classifier #1 
cls2 = LogisticRegression(random_state=RANDOM_STATE, n_jobs=-1) # Classifier #2 
cls3 = KNeighborsClassifier(n_jobs=-1) # Classifier #1
soft_vcls = VotingClassifier([('DTC', cls1), ('LRC', cls2), ('KNC', cls3)], voting="soft", n_jobs=-1) # Voting Classifier
hard_vcls = VotingClassifier([('DTC', cls1), ('LRC', cls2), ('KNC', cls3)], voting="hard", n_jobs=-1) # Voting Classifier

soft_vcls.fit(X_train, y_train)
hard_vcls.fit(X_train, y_train)

soft_pred = soft_vcls.predict(X_test)
hard_pred = hard_vcls.predict(X_test)

svlcs_scores = ""
s_avg_fmeasure = f1_score(soft_pred, y_test, average="weighted") # The average f-measure
s_avg_accuracy = accuracy_score(soft_pred, y_test) # The average accuracy

hvlcs_scores = ""
h_avg_fmeasure = f1_score(hard_pred, y_test, average="weighted") # The average f-measure
h_avg_accuracy = accuracy_score(hard_pred, y_test) # The average accuracy
#END CODE HERE

In [9]:
print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('DTC', DecisionTreeClassifier(random_state=42)),
                             ('LRC',
                              LogisticRegression(n_jobs=-1, random_state=42)),
                             ('KNC', KNeighborsClassifier(n_jobs=-1))],
                 n_jobs=-1, voting='soft')
F1 Weighted-Score: 0.8317 & Balanced Accuracy: 0.831


You should achive above 82% (Soft Voting Classifier)

In [10]:
print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('DTC', DecisionTreeClassifier(random_state=42)),
                             ('LRC',
                              LogisticRegression(n_jobs=-1, random_state=42)),
                             ('KNC', KNeighborsClassifier(n_jobs=-1))],
                 n_jobs=-1)
F1 Weighted-Score: 0.835 & Balanced Accuracy: 0.8345


You should achieve above 80% in both! (Hard Voting Classifier)

### 1.2 Stacking ###
Create a stacking classifier which uses two more complex estimators. Try different simple classifiers (like the ones mentioned before) for the combination of the initial estimators. Report your results in the following cell.

Consider as complex estimators the following:

*   Random Forest
*   SVM
*   Gradient Boosting
*   MLP




In [11]:
# BEGIN CODE HERE

cls1 = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=5) # Classifier #1 
cls2 = KNeighborsClassifier(n_jobs=-1) # Classifier #2 
cls3 = LogisticRegression(random_state=RANDOM_STATE, n_jobs=-1) # Classifier #3 (Optional)

classifiers = [('DTC', cls1),('KNC', cls2), ('SVM', cls3)]

meta_classifier = StackingClassifier([
                                      ('RFC', RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=RANDOM_STATE)),
                                      ('GBC', GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE))], 
                                      LogisticRegression(random_state=RANDOM_STATE), cv=10)

scls = StackingClassifier(classifiers, meta_classifier, n_jobs=-1) # Stacking Classifier

scls.fit(X_train, y_train)
scls_pred = scls.predict(X_test)

avg_fmeasure = f1_score(scls_pred, y_test, average="weighted") # The average f-measure
avg_accuracy = balanced_accuracy_score(scls_pred, y_test) # The average accuracy
#END CODE HERE

In [12]:
print("Classifier:")
print(scls)
print("F1 Weighted Score: {} & Balanced Accuracy: {}".format(round(avg_fmeasure,4), round(avg_accuracy,4)))

Classifier:
StackingClassifier(estimators=[('DTC',
                                DecisionTreeClassifier(max_depth=5,
                                                       random_state=42)),
                               ('KNC', KNeighborsClassifier(n_jobs=-1)),
                               ('SVM',
                                LogisticRegression(n_jobs=-1,
                                                   random_state=42))],
                   final_estimator=StackingClassifier(cv=10,
                                                      estimators=[('RFC',
                                                                   RandomForestClassifier(n_jobs=-1,
                                                                                          random_state=42)),
                                                                  ('GBC',
                                                                   GradientBoostingClassifier(random_state=42))],
                             

You should achieve above 85% in both

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements.  

In [None]:
# BEGIN CODE HERE
ens1 = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), n_estimators=150, n_jobs=-1, random_state=RANDOM_STATE)
ens2 = GradientBoostingClassifier(n_estimators=150, subsample=0.75, random_state=RANDOM_STATE)
ens3 = AdaBoostClassifier(RandomForestClassifier(n_estimators=100, n_jobs=-1), n_estimators=150, random_state=RANDOM_STATE)
tree = DecisionTreeClassifier(random_state=RANDOM_STATE)

f_measures = dict()
accuracies = dict()

titles = ["Bagging with Decision Tree", "Gradient Boosting",
          "AdaBoost with Random Forest", "Simple Tree Classifier"]

models = [ens1, ens2, ens3, tree]

for title, model in zip(titles, models):
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  f_measures[title] = f1_score(y_test, y_pred, average="weighted")
  accuracies[title] = balanced_accuracy_score(y_test, y_pred)


# Example f_measures = {'Simple Decision': 0.8551, 'Ensemble with random ...': 0.92, ...}

#END CODE HERE

In [None]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1 Weighted:{}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier:{} -  BalancedAccuracy:{}".format(name,round(score,4)))

**2.2** Describe your classifiers and your results.

ANSWER 

**Classifier 1**

Bagging using Decision Tree classifier. This classifier basically trains n_estimators number of Tree classifiers using resubtitution. In more detail, every model uses different samples from the data set. Bagging is considered to be effective against overfitting issues.

**Classifier 2**

Gradient Boosting Classifier with subsampling. This model trains n_estimators number of Tree classifiers and uses 0.75 of the dataset entries to train each model. Every intermediate, individual model is trained based on the weaknesses of the previous models, which leads to a gradual increase in accuracy for every new model trained.

**Classifier 3**

Adaptive Boosting using Random Forest Classifier. Every new model is trained with more emphasis on the data entries, for which the previous model didn't perform accurately. This is achieved using weights assigned to each example of the dataset, depending on the error of the current model.

**Classifier 4**

A simple Decision Tree Classifier

**Results**

The results pretty much meet our expectations and verify the validity of the theoretical assumptions. Observing both evaluation metrics, it is easily concluded that, the ensembles demonstrate significantly more precision than the simple Tree model. 


**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

YOUR ANSWER HERE

## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve a balanced accuracy over 83-84%?

In [None]:
# BEGIN CODE HERE
from xgboost import XGBClassifier
clf1 = SVC(random_state=RANDOM_STATE, probability=True)
clf2 = AdaBoostClassifier(RandomForestClassifier(n_estimators=100), random_state=RANDOM_STATE)
clf3 = XGBClassifier(n_estimators=100, tree_method='gpu_hist')
clf4 = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE), n_estimators=100, n_jobs=-1, random_state=RANDOM_STATE)

classifiers = [('1', clf1), ('2', clf2), ('3', clf3), ('4', clf4)]
final_clf1 = StackingClassifier(classifiers, LogisticRegression(n_jobs=-1), n_jobs=-1)
final_clf2 = StackingClassifier(classifiers, RandomForestClassifier(n_estimators=100, n_jobs=-1), n_jobs=-1)
final_clf3 = VotingClassifier(classifiers, voting="soft", n_jobs=-1)

names = ["Stacking with Logistic Regression", "Stacking with Random Forest", "Soft Voting"]
ensembles = [final_clf1, final_clf2, final_clf3]

f_measures = dict()
accuracies = dict()
scores = dict()

for name, ensemble in zip(names, ensembles):
  print(name)
  ensemble.fit(X_train, y_train)
  y_pred = ensemble.predict(X_test)
  f_measures[name] = f1_score(y_test, y_pred)
  accuracies[name] = balanced_accuracy_score(y_test, y_pred)

best_fmeasure = max(f_measures, key=lambda x: f_measures[x])
best_accuracy = max(accuracies, key=lambda x: accuracies[x])
best_score = max(scores, key=lambda x: f_measures[x])
#END CODE HERE

In [None]:
print("Classifier:")
# print(best_cls)
print("F1 Weighted-Score:{} & Balanced Accuracy:{}".format(best_fmeasure, best_accuracy))

**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

YOUR ANSWER HERE

In order to reach the desired accuracy, various classifiers were combined and through a trial & error process, the best ensemble method is one that combines the different features of every individual machine learning method. Such an ensemble will take advantage of each model's strength. The models are selected such that, each one tries to hide the weaknesses of the rest. 

A **Random Forest Classifier**, which already provides satisfying accuracy, is optimized using **adaptive boosting** so the best possible model can be extracted. In addition, a **Gradient Boosting Classifier** and a **SVM** model are also present as base-estimators. The final model used, is a **Bagging Classifier**, which was chosen in order to encounter possible **overfitting** issues, emerged by the previous models.

The ensembles used for the experiments use the previous 4 models:

**Ensemble 1** 

Stacking using **Logistic Regression** as a meta-estimator and the previous 4 models as base-estimators.

**Ensemble 2** 

Stacking using **Random Forest** as a meta-estimator and the previous 4 models as base-estimators.

**Ensemble 3** 

Soft voting the results of the the 4 base-estimators.

**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [None]:
# BEGIN CODE HERE
cls = ...
#END CODE HERE
test_set = pd.read_csv("test_set_noclass.csv")
predictions = cls.predict(test_set)

LEAVE HERE ANY COMMENTS ABOUT YOUR CLASSIFIER

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [None]:
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both should aim above 85%!