## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Christina Koutsou"
AEM = "9994 (ECE)"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [3]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [5]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x7f68b9bb28c0>)

In [17]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

In [None]:
!pip install -U imbalanced-learn

The following code will reduce the number of instances, dealing with the small imbalance of the dataset, as well as reducing the size of the dataset!

In [18]:
from collections import Counter
from imblearn.under_sampling import NeighbourhoodCleaningRule, RandomUnderSampler

ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_res, y_res)
print('Resampled dataset shape %s' % Counter(y_res))
X = X_res
y = y_res

Resampled dataset shape Counter({0: 1687, 1: 1687})


## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **5-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation. 

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses two **simple** estimators/classifiers. Test both soft and hard voting and report the results. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   KNN Models  

In [19]:
### BEGIN SOLUTION

from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

# USE RANDOM STATE!
cls1 = DecisionTreeClassifier(criterion='gini', max_depth=3, max_leaf_nodes=100, random_state=RANDOM_STATE) # Classifier #1
cls2 = LogisticRegression(max_iter = 800, random_state=RANDOM_STATE, n_jobs=-1) # Classifier #2, penalty is 'l2' by default, increased iterations limit to have better accuracy (but it takes longer to calculate and the difference isn't significant)
soft_vcls = VotingClassifier(estimators=[('cls1_soft',cls1),('cls2_soft',cls2)], voting='soft', n_jobs=-1)
hard_vcls = VotingClassifier(estimators=[('cls1_hard',cls1),('cls2_hard',cls2)], voting='hard', n_jobs=-1)

svlcs_scores = cross_validate(estimator=soft_vcls, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1) # 5-fold cross validation is used by default (cv=5)
s_avg_fmeasure = svlcs_scores['test_f1_weighted'].mean() # The average f-measure
s_avg_accuracy = svlcs_scores['test_balanced_accuracy'].mean() # The average accuracy

hvlcs_scores = cross_validate(estimator=hard_vcls, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
h_avg_fmeasure = hvlcs_scores['test_f1_weighted'].mean() # The average f-measure
h_avg_accuracy = hvlcs_scores['test_balanced_accuracy'].mean() # The average accuracy

### END SOLUTION

print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('cls1_soft',
                              DecisionTreeClassifier(max_depth=3,
                                                     max_leaf_nodes=100,
                                                     random_state=42)),
                             ('cls2_soft',
                              LogisticRegression(max_iter=800, n_jobs=-1,
                                                 random_state=42))],
                 n_jobs=-1, voting='soft')
F1 Weighted-Score: 0.8897 & Balanced Accuracy: 0.8898
Classifier:
VotingClassifier(estimators=[('cls1_hard',
                              DecisionTreeClassifier(max_depth=3,
                                                     max_leaf_nodes=100,
                                                     random_state=42)),
                             ('cls2_hard',
                              LogisticRegression(max_iter=800, n_jobs=-1,
                                                 random_state=42))],

For both soft/hard voting classifiers the F1 weighted score should be above 0.74 and 0.79, respectively, and for balanced accuracy 0.74 and 0.80. Remember! This should be the average performance of each fold, as measured through cross-validation with 5 folds!

### 1.2 Randomization

You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements. Use the same cross-validation approach as before! 

In [20]:
### BEGIN SOLUTION
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier

ens1 = RandomForestClassifier(max_depth=10, max_leaf_nodes=100, random_state=RANDOM_STATE, n_jobs=-1) #same parameters with the simple decision tree classifier
# for the comparison to be more accurate, not necessary to have such a complicated base with many parameters in order to compare them
ens2 = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=10, max_leaf_nodes=100, random_state=RANDOM_STATE), random_state=RANDOM_STATE)
ens3 = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=10, max_leaf_nodes=100, random_state=RANDOM_STATE), random_state=RANDOM_STATE, n_jobs=-1)
tree = DecisionTreeClassifier(criterion='gini', max_depth=10, max_leaf_nodes=100, random_state=RANDOM_STATE)

ens1_scores = cross_validate(estimator=ens1, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1) # 5-fold cross validation is used by default (cv=5)
ens2_scores = cross_validate(estimator=ens2, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
ens3_scores = cross_validate(estimator=ens3, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)
tree_scores = cross_validate(estimator=tree, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1)

f_measures = {'Ensemble with Random Forest classifier': ens1_scores['test_f1_weighted'].mean(),
              'Ensemble with Ada Boost classifier': ens2_scores['test_f1_weighted'].mean(),
              'Ensemble with Bagging classifier': ens3_scores['test_f1_weighted'].mean(),
              'Simple Decision': tree_scores['test_f1_weighted'].mean()
              }
accuracies = {'Ensemble with Random Forest classifier': ens1_scores['test_balanced_accuracy'].mean(),
              'Ensemble with Ada Boost classifier': ens2_scores['test_balanced_accuracy'].mean(),
              'Ensemble with Bagging classifier': ens3_scores['test_balanced_accuracy'].mean(),
              'Simple Decision': tree_scores['test_balanced_accuracy'].mean()
              }
# Example f_measures = {'Simple Decision':0.8551, 'Ensemble with random ...': 0.92, ...}


### END SOLUTION

print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier: {} -  F1 Weighted: {}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier: {} -  BalancedAccuracy: {}".format(name,round(score,4)))

RandomForestClassifier(max_depth=10, max_leaf_nodes=100, n_jobs=-1,
                       random_state=42)
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=10,
                                                    max_leaf_nodes=100,
                                                    random_state=42),
                   random_state=42)
BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=10,
                                                   max_leaf_nodes=100,
                                                   random_state=42),
                  n_jobs=-1, random_state=42)
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=100, random_state=42)
Classifier: Ensemble with Random Forest classifier -  F1 Weighted: 0.85
Classifier: Ensemble with Ada Boost classifier -  F1 Weighted: 0.8476
Classifier: Ensemble with Bagging classifier -  F1 Weighted: 0.8321
Classifier: Simple Decision -  F1 Weighted: 0.754
Classifier: Ensemble with Random Forest classifier -  BalancedA

### 1.3 Question

Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

The most efficient way to reduce the time it takes for the calculations to be performed is to parallelize the procedures of sampling the dataset and of forming smaller decision trees, as each tree is independent of the other. However, this can not be applied to the boosting classifiers as at each iteration the sample weights change and influence the samples chosen for the next iteration. So, each estimator's training depends on the previous estimators' performance and sample weights, which makes this procedure sequential and, thus, it cannot be parallelized. Although, once the boosting process is completed and all the estimators (weak models) are trained, they can be used independently to make predictions on new instances in parallel.
Other methods to reduce the calculation time of a bagging classifier is limiting the number of samples or the depth of each estimator tree and overall limit its accuracy to gain time. However this would probably not be optimal or save much time. In contrast, deducting the training to a subset of the original set by giving it as an input to cross_validation could yield better results.

## 2.0 Creating the best classifier ##
In the second part of this assignment, we will try to train the best classifier, as well as to evaluate it using stratified cross validation.

### 2.1 Good Performing Ensemble

In this part of the assignment you are asked to train a good performing ensemble, that is able to be used in a production environment! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy, using 10-fold stratified cross validation, of your final classifier. Can you achieve a balanced accuracy over 88%, while keeping the training time low? (Tip 1: You can even use a model from the previous parts, but you are advised to test additional configurations, and ensemble architectures, Tip 2: If you try a lot of models/ensembles/configurations or even grid searches, in your answer leave only the classifier you selected as the best!)

In [24]:
### BEGIN SOLUTION
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier

est1 = KNeighborsClassifier(n_jobs=-1)
est2 = RandomForestClassifier(max_depth=25, max_leaf_nodes=300, n_jobs=-1, random_state=RANDOM_STATE, bootstrap=False) # parameters pointed out by a grid search, default values are not specified

best_cls = StackingClassifier(estimators=[('KNeighbors', est1), ('Random Forest', est2)], n_jobs=-1)

str_folds = StratifiedKFold(n_splits=10,shuffle=True, random_state=RANDOM_STATE)

best_cls_scores = cross_validate(estimator=best_cls, X=X, y=y, scoring=['f1_weighted', 'balanced_accuracy'], n_jobs=-1, cv=str_folds, return_estimator=True)

best_model_index = best_cls_scores['test_balanced_accuracy'].argmax()
best_model = best_cls_scores['estimator'][best_model_index] # best model

# To report the performance of the best model:
# best_fmeasure =  best_cls_scores['test_f1_weighted'].max()
# best_accuracy =  best_cls_scores['test_balanced_accuracy'].max() # highest accuracy

best_fmeasure =  best_cls_scores['test_f1_weighted'].mean()
best_accuracy =  best_cls_scores['test_balanced_accuracy'].mean() # highest accuracy

### END SOLUTION

print("Classifier:")
print(best_cls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(best_fmeasure, best_accuracy))

Classifier:
StackingClassifier(estimators=[('KNeighbors', KNeighborsClassifier(n_jobs=-1)),
                               ('Random Forest',
                                RandomForestClassifier(bootstrap=False,
                                                       max_depth=25,
                                                       max_leaf_nodes=300,
                                                       n_jobs=-1,
                                                       random_state=42))],
                   n_jobs=-1)
F1 Weighted-Score: 0.8900151148519637 & Balanced Accuracy: 0.8900288813750352


Read part 2.2 first for better understanding. The grid search for Random Forest indicated better training accuracy for `bootstrap=True`, but the accuracy of the predictions for the testing set was higher in the case of `bootstrap=False`, so this was used instead (probably due to a slight overfitting). The values of the parameters that matched the default ones were neglected in the function call. To compare the ensembles, the `mean()` method was used for the scores. The best model is identified in this cell to give the ability to print its scores using the commented out lines.

### 2.2 Question
 What other ensemble architectures you tried, and why you did not choose them as your final classifier?

From part 1, I observed that the Random Forest classifier produces the highest accuracy and can also be parallelized which would be time efficient. However, even after a grid search on some variables regarding the estimator trees, the balanced accuracy score could not reach over 0.87. Afterwards, I tried Ada boost instead but still could not produce that high of an accuracy.
My next thought was to use a combination of methods. Since Ada Boost and Random Forest could only improve the model up to a certain point and performed very similarly at part 1.2, their combination would probably not increase the accuracy much further (and it was confirmed through testing as well). So, another estimator should be used along with Random Forest which is preferred from Ada Boost due the fact that it enables the parallel production of its estimators. Looking into the [suggestions of scikit-learn](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html), amd trying LinearSVC, KNeighbors and SVC, the accuracy was increasing along the way, reaching an average value of 0.9 in the case of SVC. However, the time it took for the training using the SVC method was significant (around 7 minutes). LinearSVC took less time but still considerable, but KNeighbors allows parallelization and produced better accuracy than LinearSVC, though lower than SVC. Since we want to achieve an accuracy of over 88% and keep the training time as low as possible, the trade-off between KNeighbors and SVC tended to favor the first one as the optimal option. Other estimators that were considered and tried out are SDG and K-Mean (this one is usually used for clustering but was configured for classification in this case).

Parameter grid search for Random Forest:
```
param_grid = {
    "min_samples_leaf": [1, 10, 20, 50],
    "max_leaf_nodes": [10, 50, 100, 150, 300],
    "bootstrap": [True, False],
    "max_features": ['sqrt', 'log']
}

grid_search = GridSearchCV(est2, param_grid, cv=10, n_jobs=-1) # replace est2 with a Random Forest classifier, use 10-fold cross validation
grid_search.fit(X, y)

print("Best params:")
print(grid_search.best_params_)
```

### 2.3 Setup the Final Classifier
Finally, in this last cell, set the cls variable to either the best model as occurred by the stratified cross_validation, or choose to retrain your classifier in the whole dataset (X, y). There is no correct answer, but try to explain your choice. Then, save your model using pickle and upload it with your submission to e-learning!

In [27]:
import pickle

### BEGIN SOLUTION
cls = best_model # the best model deducted from cross_validation is already fitted by the cross_validation method,
# avoids overfitting

# otherwise:
# cls = best_cls
# cls.fit(X,y)

# save with pickle
file_name = "best_model.pkl"
pickle.dump(cls, open(file_name, "wb"))
### END SOLUTION


# load
cls = pickle.load(open(file_name, "rb"))

test_set = pd.read_csv("test_set_noclass.csv")
predictions = cls.predict(test_set)

# We are going to run the following code
if False:
    from sklearn.metrics import f1_score, balanced_accuracy_score
    final_test_set = pd.read_csv('test_set_noclass.csv')
    ground_truth = final_test_set['CLASS']
    print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
    print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both metrics should aim above 82%! This is going to be tested by us! Make sure your cross validation or your retrained model achieves high balanced accuracy and f1_score (based on 2.1) (more than 88%) as it should achieve at least 82% in our unknown test set!


Please provide your feedback regarding this project! Did you enjoy it? 

In [26]:
# YOUR ANSWER HERE