# Lab assignment: building ensembles

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/ensemble.jpg"/>

 <div align="right">
  Photo credit: <a href=https://www.flickr.com/photos/buffo400/44243474840>Heinz Bunse - The french ensemble nevermind on stage of the Strawinsky Saal Donaueschingen</a>
</div> 

In this assignment we will start from some base classifiers and build ensembles of them to improve performance. Finally, we will compare these ensembles with tree-specific versions, so as to see which model performs best on a battery of classification datasets.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
You will need to solve a question by writing your own code or answer in the cell immediately below or in a different file, as instructed.</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
This is a hint or useful observation that can help you solve this assignment. You should pay attention to these hints to better understand the assignment.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/pro.png" height="80" width="80" style="float: right;"/>

***
<font color=#259b4c>
This is an advanced exercise that can help you gain a deeper knowledge into the topic. Good luck!</font>

***

To avoid missing packages and compatibility issues you should run this notebook under one of the [recommended Ensembles environment files](https://github.com/albarji/teaching-environments-ensembles).

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Shift+Tab to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Initialization

Let's start with some code to configure matplotlib and fix the random seed

In [None]:
import numpy as np

np.random.seed(42)
%matplotlib inline

## Data loading

In this assignment we will work with 13 datasets located under the *data* folder. Let's take a look at them

In [None]:
from os import listdir

dataset_files = listdir('./data/')
dataset_files

Each of the 13 datasets is represented as two files: one for the training data (`.train`) and another one for the test data (`.test`).

To begin, we will prepare some code to load the datasets. All of them follow the same format, so it makes sense to define some functions for this task. But let's start with the minimum block: how to load a single file?

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
Use a text editor to look at the contents of any of the files. These files follow a fixed width format, which means all data columns are aligned and separated by whitespaces. Fortunately, numpy includes the <a href=https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html>loadtxt</a> function that allows loading these kind of files.
    
Also, there are no headers! This is because the interpretation is easy: that last column is the output class, while the rest of columns are input features.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Create a funcion <b>load_datafile</b> that receives the full path to a datafile and returns <b>X</b> and <b>y</b> numpy arrays with the input features and classes contained in that file. Make sure that the returned <b>y</b> is a numpy array of a single dimension.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

If implemented correctly, the following call should properly load the features and targets of the **banana** training dataset.

In [None]:
X_train_banana, y_train_banana = load_datafile("./data/banana.train")

In [None]:
X_train_banana

In [None]:
y_train_banana

Moving one step further, let's now create a function that loads the two datasets available for the same problem (`.train` and `.test`) and organizes them in a dictionary in the form

    {
        "train": (X_train, y_train),
        "test": (X_test, y_test)
    }
    
That is, the dictionary must have a key `train` containing a tuple or list with the matrix of input features `X` and the targets `y` for the training data. Similarly, a `test` key must also be present, containing features and targets for the test data.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Create a funcion <b>load_dataset</b> that receives the path to a dataset (without the <i>.train</i> or <i>.test</i> extension) and returns a dictionary with the training and test data, in the format presented above.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
    Make use of the <b>load_datafile</b> function you created above!
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

Let's try your function by loading all the data for the banana problem

In [None]:
data_banana = load_dataset("./data/banana")
data_banana

Now for the final loading step!

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Create a funcion <b>load_datasets</b> that receives no arguments and returns a dictionary indexed by the dataset names, containing all the data from those datasets. That is, it must return a dictionary in the form
    
<pre>
datasets = {
  "titanic" : {
    "train" : (X_train_titanic, y_train_titanic),
    "test" : (X_test_titanic, y_test_titanic)
  },
  "thyroid" : {
    "train" : (X_train_thyroid, y_train_thyroid),
    "test" : (X_test_thyroid, y_test_thyroid)
  },
  ...
}
</pre>
    
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
    To obtain the names of all the files containing data, you can use the <b>listdir</b> function presented at the beginning of this notebook. You can also make use of the <a href=https://www.w3schools.com/python/ref_string_split.asp>split</a> function to extract just the filename from a full file path.<br>
    Also, make use of the <b>load_dataset</b> function you created above!
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

Let's try your function works!

In [None]:
datasets = load_datasets()
datasets

### Saving the data file

Since we will use these datasets in more notebooks after this, we will save the organized object we have created as a `pickle` file. These files can hold complex python structures, and are more efficient in terms of disk space than text files. We can create a pickle file for our data as follows

In [None]:
import pickle as pkl

with open('datasets.pkl', 'wb') as file:
    pkl.dump(datasets, file)

With this, in further notebooks we will be able to load back the exactly same data, in the exactly same structure, by copying the file to the notebook folder and using the following code

```
import pickle as pkl

with open('datasets.pkl', 'rb') as file:
    datasets = pickle.load(file)
```

## Base learners

In order to build an ensemble, we need to decide which are the individual learners that we want to combine. Remember that ideally we would like two things (most often contradictory) in those learners:
* <i>Accuracy</i>: the more accurate they are, the more accurate their combination is expected to be.
* <i>Diversity</i>: the more diverse they are, the more likely it is that their combination surpasses them.

As a compromise between these two goals and because of some restrictions scikit-learn imposes, we will use the following base classifiers:
* <a href="https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression">Logistic Regression</a>
* <a href="https://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd">Stochastic Gradient Descent</a>
* <a href="https://scikit-learn.org/stable/modules/naive_bayes.html">(Gaussian) Naïve Bayes</a>
* <a href="https://scikit-learn.org/stable/modules/tree.html">Decision tree</a>

Which are comprised in the following list, together with their names in short:

In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

base_learners = [
    ('lr', LogisticRegression()), 
    ('sgd', SGDClassifier()), 
    ('nb', GaussianNB()), 
    ('dt', DecisionTreeClassifier())
]

Time to fit them on the training data to have their reference accuracies.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Fit the four models above on all training sets and collect the accuracies on the test sets in a dataframe called <b><i>base_scores</b></i> (one row per dataset, one column per model). Use as column names the names of the base learners themselves.
    
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Plot the accuracies you obtained. Which is the winner model for each dataset? Do you think these models are accurate and diverse?
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
    You can easily plot the contents of a pandas DataFrame using the <a href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html>plot</a> method of the DataFrame itself. Configure the <b>kind</b> parameter to choose an adequate kind of plot. Check also the rest of parameters to control the plot style.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

## Bagging and boosting

Perhaps the easiest way to form an ensemble is with the <b><a href="https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator">bagging</a></b> technique. Scikit-learn allows us to build bags of whichever base estimator we want. Let us see if we can do better than with individual estimators:

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Build bags of the four base models and fit them on all training sets, collecting the resulting accuracies on test as you did before, this time on a dataframe called <b><i>bagging_scores</i></b>. Use as column names the names of the base learners plus the suffix <b><i>'_bag'</i></b>, so as to make the difference clear.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
Check carefully the documentation of the <b><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html">BaggingClassifier</a></b> class. Pay particular attention to the parameters <b><i>base_estimator</i></b> (use each base model!), <b><i>n_estimators</i></b> (use the <b>ESTIM</b> variable provided below) and <b><i>n_jobs</i></b> (take advantage of parallelism!).
</font>

***

In [None]:
from sklearn.ensemble import BaggingClassifier
ESTIMS = 50

In [None]:
####### INSERT YOUR CODE HERE

Let's see if we have actually improved:

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Concatenate <i>base_scores</i> and <i>bagging_scores</i> in another dataframe <b><i>all_scores</i></b> and plot the accuracies obtained. Which is the winner model for each dataset now? Did bagging help? Why/why not?
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

Another popular option to build ensembles is via <b><a href="https://scikit-learn.org/stable/modules/ensemble.html#adaboost">boosting</a></b>. Scikit-learn is more restrictive in the kind of models amenable to be boosted, but in this case all the models we selected are valid:

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Build boosted ensembles of the four base models and fit them on all training sets, collecting the resulting accuracies on test as you did before, this time on a dataframe called <b><i>boosting_scores</i></b>. Use as column names the names of the base learners plus the suffix <b><i>'_boo'</i></b>, so as to make the difference clear.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
As for bagging, check carefully the documentation of the <b><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html">AdaBoostClassifier</a></b> class. You already know what <b><i>base_estimator</i></b> and <b><i>n_estimators</i></b> are for, but now you need to specify also <b><i>algorithm='SAMME'</i></b> (the default 'SAMME.R' will crash!). Can you explain why is there no <b><i>n_jobs</i></b> in this case?
</font>

***

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
####### INSERT YOUR CODE HERE

Time to check which option is best.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Append <i>boosting_scores</i> to <i>all_scores</i> and plot now the accuracies obtained for both bagging and boosting. Which is the winner model for each dataset now? Did boosting help? Why/why not?
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

## Voting and stacking

One limitation of bagging and boosting is that they can only combine <b><i>homogeneous</i></b> models. For example, we cannot combine a Logistic Regression with a Decision Tree in the same bagging/boosting ensemble. This can become problematic unless the models being combined are unstable (i.e., sensitive to slight changes in the data). 

Think about this: if the mixture is homogeneous, and the models being mixed are also stable, the combination will be nearly identical to the model! So why caring about building a combination? We will be better off with the stable, individual model itself (that is, the way we started in this notebook).

Fortunately, there are ways out of this. The first one is resorting to other ensemble strategies that combine <b><i>heterogeneous models</b></i>. Scikit-learn has the following ones:
* <b><a href="https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier">Voting</a></b>: let each model output its estimated label for some pattern, and combine what models say on average/majority (usually with equal weight each).
* <b><a href="https://scikit-learn.org/stable/modules/ensemble.html#stacked-generalization">Stacking</a></b>: let each model output its estimated label for some pattern, and build another model on top of this which takes those estimated labels and decides how to combine them best to approximate the actual labels. 

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Build voting and stacking ensembles of the 4 base models and fit them on all training sets, collecting the resulting accuracies on test as you did before, this time on a dataframe called <b><i>mixing_scores</i></b>. This dataframe will only have 2 columns (why?); call them <b><i>'vote'</b></i> and <b><i>'stack'</b></i>, respectively.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
Inspect the documentation of <b><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html">VotingClassifier</a></b> and <b><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html">StackingClassifier</a></b> classes to use them properly.
</font>

***

In [None]:
from sklearn.ensemble import VotingClassifier, StackingClassifier

In [None]:
####### INSERT YOUR CODE HERE

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
     Append <i>mixing_scores</i> to <i>all_scores</i>. Since we have quite a lot of scores and it is difficult to compare them all, obtain the following:
     <ol>
         <li>Winner model for each dataset (i.e., the one with highest accuracy for each row)</li>
         <li>Model ranking across all datasets (i.e., sorted by decreasing average accuracy, taken over all datasets)</li>
     </ol>
     
Print the winner models and plot the ranking you get. What do you conclude?
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
Pandas can compute statistics across rows or columns of a Dataframe. For instance, check the <a href=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html>mean</a>.<br>
    Also, for sorting the scores you can use the pandas <a href=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html>sort_values</a> method.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE