# Homework #5

## Ensemble learning

This colaboratory contains Homework #5 of the Machine Learning course, which is due **November 13, midnight (23:59 EEST time)**. To complete the homework, extract **(File -> Download .ipynb)** and submit to the course webpage.


## Submission's rules:

1.   Please, submit only .ipynb that you extract from the Colaboratory.
2. Run your homework exercises before submitting (output should be present, preferably restart the kernel and press run all the cells).
3. Do not change the description of tasks in red (even if there is a typo|mistake|etc).
4. Please, make sure to avoid unnecessary long printouts.
5. Each task should be solved right under the question of the task and not elsewhere.
6. Solutions to both regular and bonus exercises should be submitted in one IPYNB file.

Please, steer clear of copying someone else's work. If you discuss assignments with anyone in the course, please, mention their names here:

Pooh

##List of Homework's exercises:

1.   [Ex1](#scrollTo=ux5PBYkbwewj) - 3 points
2.   [Ex2](#scrollTo=Gezm0AO80ary) - 4 points
3.   [Ex3](#scrollTo=avCryKaDzJqn) - 3 points
4.   [Bonus 1](#scrollTo=jdZkblZW7bEp) - 2 points (based on quality of presentation)


In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# For plotting like a pro
!pip install -q plotnine
from plotnine import *

from sklearn.linear_model import LogisticRegression

In [2]:
def create_random_2c_data (D, N):
  """
  Function create_random_2c_data generates two sets of D dimensional 
  points (N points each), one for each class. The first set is sampled from D 
  dimensional Gaussian distribution with mean 0 and standard deviation 1. The 
  second set is generated from the distribution, with mean 1 and standard 
  deviation 1.
  """
  # Generating N points for the first class
  mu_vec1 = np.zeros(D) # creates a vector of zeros, these are averages across each dimension
  cov_mat1 = np.eye(D) # creates a diagonal matrix of size D x D, all values except diagonal are 0
  class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, N)

  # The same stuff as above, just averages are shifted into 1
  mu_vec2 = np.ones(D) # creates a vector of ones
  cov_mat2 = np.eye(D)
  class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, N)

  # a lot of boring things....
  # gluing together two matrices generated above
  data = pd.DataFrame(np.concatenate((class1_sample, class2_sample)))

  # Create names for columns
  data.columns = [ 'x' + str(i) for i in (np.arange(D)+1)]

  # Create a class column
  data['class'] = np.concatenate((np.repeat(0, N), np.repeat(1, N)))

  # This is important for plotting and modelling
  data['class'] = data['class'].astype('category')

  return data



---


## Homework exercise 1: implement and train a bagging classifier with 3 K-NN models as estimators (3 points)


<font color='red'> In this exercise you will need to use `classify_knn` function from the first practice session to train three different KNN models on three resamples of this dataset. </font> 


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
np.random.seed(2342347823) # random seed, this number was random, no need to make conspiracies around it

D = 2 # two dimensions
N = 100 # points per class

whole_data = create_random_2c_data(D, N)

# Randomly splitting data into train (60%) and validation (40%)
train, val = train_test_split(whole_data, random_state = 111, test_size = 0.40) 

n_bootstraps = 3
np.random.seed(1111)

# creating resamples
resamples = [resample(train, n_samples = int(len(train)*0.8), replace=False).index.values for i in range(n_bootstraps)]

# first resample
train_resample1 = train.loc[resamples[0]]

# second resample
train_resample2 = train.loc[resamples[1]]

# third resample
train_resample3 = train.loc[resamples[2]]

<font color='red'> Here, I just convert pandas DataFrame into Numpy arrays that are easier to use list comprehension mechanisms on. </font> 

In [4]:
train1 = np.asarray(train_resample1[['x1','x2']])
labels1 = np.asarray(train_resample1[['class']]).reshape((train_resample1.shape[0]))

train2 = np.asarray(train_resample2[['x1','x2']])
labels2 = np.asarray(train_resample2[['class']]).reshape((train_resample2.shape[0]))

train3 = np.asarray(train_resample3[['x1','x2']])
labels3 = np.asarray(train_resample3[['class']]).reshape((train_resample3.shape[0]))

val_points = np.asarray(val[['x1','x2']])
val_labels = np.asarray(val[['class']]).reshape((val.shape[0]))

<font color='red'>  **(Homework exercise 1- a)** Copy and adapt `classify_knn` function from the first homework and practice session to operate on 2D points. **(1 point)**</font> 

In [5]:
def dist(point1, point2): # function dist is also from the first practice session
  # sum of squared coordinate-wise differences under sqrt
  return(np.sqrt(np.sum((point2 - point1)**2)))

def classify_knn(val_point, k, train, labels):
  ##### YOUR CODE STARTS #####
  all_distances = [dist(val_point, point) for point in train]
  nearest_neighbours = np.argsort(all_distances)
  predicted_classes = [labels[index] for index in nearest_neighbours[:k]]
  prediction = max(predicted_classes, key=predicted_classes.count)
  ##### YOUR CODE ENDS ##### 
  
  return prediction

<font color='red'> Test that the function was adapted correctly by running the following example </font> 

In [6]:
val_point = val_points[1]
print(f'predicted class of the first point is {classify_knn(val_point, 5,  train1, labels1)}, while the true class is {val_labels[1]}')

predicted class of the first point is 0, while the true class is 0


<font color='red'> **(Homework exercise 1- b)** Classify each point from the validation set using `classify_knn` function. Use different resamples and list comprehension [(do something with point) for point in points]. Fix `k` at 5. **(1 point)**</font> 


In [7]:
k = 5

# Use three K-NN models that work on three different resamples

##### YOUR CODE STARTS #####
val['knn1'] = [classify_knn(val_point, k, train1, labels1) for val_point in val_points]
val['knn2'] = [classify_knn(val_point, k, train2, labels2) for val_point in val_points]
val['knn3'] = [classify_knn(val_point, k, train3, labels3) for val_point in val_points]
##### YOUR CODE ENDS ##### 

<font color='red'> **(Homework exercise 1- c)** Aggregate individual predictions using the majority vote approach **(0.5 points)**</font> 

In [8]:
##### YOUR CODE STARTS #####
val['knn_bagging'] = val[['knn1', 'knn2', 'knn3']].mode(axis = 1)
##### YOUR CODE ENDS ##### 

print(f"Accuracy of hand made bagged ensemble with 3 KNNs is {np.sum(val['knn_bagging'] == val['class'])/len(val[['class']])*100}%")

Accuracy of hand made bagged ensemble with 3 KNNs is 75.0%


<font color='red'> **(Homework exercise 1- d)** Use sklearn `BaggingClassifier` to implement analogous model that uses KNeighborsClassifier as an estimator (with k = 5). Don't forget to use a random state for reproducibility.

Assess its performance on the same validation set and display it. **(0.5 points)**</font> 


In [9]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

##### YOUR CODE STARTS #####
knn_begger = BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=k), max_samples=0.8, n_estimators=3, random_state=0).fit(train[['x1','x2']], train[['class']])
##### YOUR CODE ENDS ##### 
print(f"Accuracy of sklearn bagging with {3} KNNs {knn_begger.score(val[['x1', 'x2']], val[['class']])*100}%")

Accuracy of sklearn bagging with 3 KNNs 70.0%


## Homework exercise 2: eXtreme Gradient Boosting (XGBoost) (4 points)

<font color='red'> Let's finally build for ourselves a new shiny XGBoost model, the most popular algorithm for Kaggle competitions. </font>

<font color='red'> First, we need to load data (we shall use MNIST data again). </font>

In [10]:
from keras.datasets import mnist
(images, labels) = mnist.load_data()[0]

# reshape into a matrix format
images = images.reshape(-1, 28*28)

# use fewer images for faster training
train_images = images[0:2000]
train_labels = labels[0:2000]

test_images = images[2000:3000]
test_labels = labels[2000:3000]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


<font color='red'> **(Homework exercise 2- a)** Use the tutorial page (https://xgboost.readthedocs.io/en/latest/python/python_intro.html and https://www.kaggle.com/anktplwl91/mnist-xgboost) to fill in the gaps in the following code and traing the XGBoost model. **(1 point)** </font>

In [11]:
import xgboost as xgb

##### YOUR CODE STARTS #####

# XGBoosts wants data to be wrapped into special formats
dtrain = xgb.DMatrix(train_images, label=train_labels)
dtest = xgb.DMatrix(test_images, label=test_labels)

# most meaningful parameters
param_list = [("objective", "multi:softmax"), ("eval_metric", "merror"), ("num_class", 10)]

# Number of trees
n_rounds = 600

# if nothing seems to improve for 50 iterations - stop
early_stopping = 50

# train for training and test for ... validation!    
eval_list = [(dtrain, 'train'), (dtest, 'eval')]

# 1,2,3.. go!

%time bst = xgb.train(param_list, dtrain, n_rounds, eval_list, verbose_eval=False)
##### YOUR CODE ENDS #####

CPU times: user 4min 21s, sys: 488 ms, total: 4min 21s
Wall time: 2min 26s


<font color='red'> **(Homework exercise 2- b)** Use the same tutorial page (https://xgboost.readthedocs.io/en/latest/python/python_intro.html) to find out how to evaluate the model **(1 point)** </font>

In [12]:
##### YOUR CODE STARTS #####
from sklearn.metrics import accuracy_score
test_images_2 = images[3000:4000]
test_labels_2 = labels[3000:4000]

ypred = bst.predict(xgb.DMatrix(test_images_2))
print("XGBoost accuracy:", accuracy_score(ypred, test_labels_2))
##### YOUR CODE ENDS #####

XGBoost accuracy: 0.911


<font color='red'> Are you impressed with XGBoost performance? </font>

<font color='red'> **(Homework exercise 2- c)** Train Adaptive Boosting, Gradient Boosting and a simple KNN model from sklearn (KNeighborsClassifier) on the same trainign data and evaluate on the same test data. For each model use the default hyperparameters (e.g. `n_estimators` or `n_neighbors`). If you do not want to use default parameters, you can use `cross_val_score` to pick the best values for the hyperparameters using training data. Compare the performance of these three models and XGBoost and draw conclusions in a separate text cell.  **(2 points)** </font>

In [13]:
# AdaBoostClassifier
%%time
##### YOUR CODE STARTS #####
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier()
abc.fit(train_images, train_labels)
print("Accuracy:", abc.score(test_images, test_labels))
##### YOUR CODE ENDS #####

Accuracy: 0.524
CPU times: user 2.85 s, sys: 19.9 ms, total: 2.87 s
Wall time: 2.92 s


In [14]:
# GradientBoostingClassifier
%%time
# might take considerable time if trained with default number of n_estimators
##### YOUR CODE STARTS #####
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(train_images, train_labels)
print("Accuracy:", gbc.score(test_images, test_labels))
##### YOUR CODE ENDS #####

Accuracy: 0.891
CPU times: user 1min 42s, sys: 94.7 ms, total: 1min 42s
Wall time: 1min 42s


In [15]:
# KNeighborsClassifier
%%time
##### YOUR CODE STARTS #####
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh.fit(train_images, train_labels)
print("Accuracy:", neigh.score(test_images, test_labels))
##### YOUR CODE ENDS #####

Accuracy: 0.912
CPU times: user 303 ms, sys: 87.6 ms, total: 391 ms
Wall time: 207 ms


<font color='red'> How these models compare with each other and to XGBoost? Can you try to elaborate on this difference? </font>

In [16]:
# Write your comment here:

# XGBoost and GradinetBoosting are sequential so they take a lot longer to train.
# AdaBoost recieved a rather low accuracy, remaining models achieved very similar accuracies.
# Changing AdaBoost default estimator to SVC(probability=True, kernel='linear') got an accuracy of 0.911.
# So its accuracy must be low because of the suboptimal base estimator.

## Homework exercise 3: implement blending approach (3 points)
<font color='red'> In this exercise you will practice using blending approach to meta-learning. </font>

<font color='red'> **(Homework exercise 3- a)** to implement blending we first need to create a separate validation set that would be independent from training and test data. Below, use images from 0 to 1500 as training data, images from 1500 to 2000 as validation and from 2000 to 3000 as a test set. **(0.5 points)** </font>

In [17]:
from keras.datasets import mnist
(images, labels) = mnist.load_data()[0]

# reshape into a matrix format
images = images.reshape(-1, 28*28)

##### YOUR CODE STARTS #####
train_images = images[0:1500]
train_labels = labels[0:1500]

val_images = images[1500:2000]
val_labels = labels[1500:2000]

test_images = images[2000:3000]
test_labels = labels[2000:3000]
##### YOUR CODE ENDS #####

<font color='red'> **(Homework exercise 3- b)** Train three models (decision tree, k nearest neighbors classifier, and the logistic regression) with default parameters on the train data. **(0.5 points)** </font>

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

##### YOUR CODE STARTS #####
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = KNeighborsClassifier()
##### YOUR CODE ENDS #####

np.random.seed(1111) 
##### YOUR CODE STARTS #####
model1.fit(train_images, train_labels)
print("LogisticRegression accuracy:", model1.score(val_images, val_labels))
##### YOUR CODE ENDS #####

np.random.seed(1111) 
##### YOUR CODE STARTS #####
model2.fit(train_images, train_labels)
print("DecisionTreeClassifier accuracy:", model2.score(val_images, val_labels))
##### YOUR CODE ENDS #####

np.random.seed(1111) 
##### YOUR CODE STARTS #####
model3.fit(train_images, train_labels)
print("KNeighborsClassifier accuracy:", model3.score(val_images, val_labels))
##### YOUR CODE ENDS #####

LogisticRegression accuracy: 0.87
DecisionTreeClassifier accuracy: 0.726
KNeighborsClassifier accuracy: 0.886


<font color='red'> **(Homework exercise 3- c)** Create a training set for the meta-learner by concatenating the predictions made by individual models on validation images. Hint: use function `np.concatenate` and `predict_proba` as we did for stacking. **(0.5 points)** </font>

In [19]:
##### YOUR CODE STARTS #####
train_blending =  np.concatenate([model1.predict_proba(val_images),
                                 model2.predict_proba(val_images), 
                                 model3.predict_proba(val_images)], 
                                axis = 1)
##### YOUR CODE ENDS #####

train_blending_labels = val_labels
train_blending.shape # if all was done correctly this shape should be (500, 30)

(500, 30)

<font color='red'> **(Homework exercise 3- d)** Create a test set for the meta-learner by concatenating the predictions made by each model on test images. Use the same function as in the cell above. **(0.5 points)** </font>

In [20]:
##### YOUR CODE STARTS #####
test_blending = np.concatenate([model1.predict_proba(test_images),
                                 model2.predict_proba(test_images), 
                                 model3.predict_proba(test_images)], 
                                axis = 1)
##### YOUR CODE ENDS #####

test_blending.shape # if all was done correctly this shape should be (1000, 30)

(1000, 30)

<font color='red'> **(Homework exercise 3- e)** Use a new model (SVM) as a meta-learner and train it on the `train_blending` data. **(0.5 points)** </font>

In [21]:
from sklearn.svm import SVC
np.random.seed(1111) 

##### YOUR CODE STARTS #####
blending_model = SVC()
blending_model.fit(train_blending, val_labels)
##### YOUR CODE ENDS #####

SVC()

<font color='red'> **(Homework exercise 3- f)** Evaluate the performance of the blending ensemble on the test set and comment on the difference between blending and stacking (that we tried in the practice session). Which one would you prefer and why?  **(0.5 points)** </font>

In [22]:
##### YOUR CODE STARTS #####
print(f"Blending esemble accuracy {blending_model.score(test_blending, test_labels)*100}%")
##### YOUR CODE ENDS #####

Blending esemble accuracy 89.8%


In [23]:
# What is your take on the difference between blending and stacking (from practice session)?
# Which one would you prefer and why?
# Comment here:

# Stacking outperformed blending in this example. 
# Blending uses a validation set and its predictions to build a new model. 
# Stacking uses folds of training set to do the same. 
# They are both very similar approaches but I think I would initially prefer blending, 
# because blending trains a model on only a small protion of the training set and should 
# therefore be faster than stacking while still getting similarly high accuracy.

# Bonus exercises
*(NB, these are optional exercises!)*
 

## Bonus exercise 1 (2 bonus points):
<font color='red'> We have seen that in general increasing the number of estimators in an ensemble leads to better performance. In this exercise, you will experiment with the number of estimators in different ensemblers to explore **convergence behavior** and **overfitting**.  
You will compare the performance of **bagging of decision trees, random forests, extreme RF, boosting with decision trees and stacking different decision trees.** 
</font>
<font color='red'>
* Use MNIST dataset
* train different ensembles with various number of decision trees (ranging from 1 to 2000. i.e. choose a small max_depth for speed) 
* plot the **classification error** with the number of decision trees in each ensembler in the same plot.
* Compare the convergence behavior to other ensembles (e.g. RFs as base classifiers)
* Explain the behaviour that your observe.
</font>

<font color='red'>
As usual, technical depth and good presentation are the key to get bonus points.
</font>


In [24]:
##### YOUR CODE STARTS #####

##### YOUR CODE ENDS ##### (please do not delete this line)

# Comments (optional feedback to the course instructors)
Here, please, leave your comments regarding the practice session, possibly answering the following questions: 
* what would you suggest to add or remove?
* anything else you would like to tell us

In [25]:
# Add your comments here:
