In [1]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)

Already up to date.


In [2]:
# modules
import pandas as pd
import numpy as np
import dsutils
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV


# Ensemble Techniques

We start by asking the following questions:

> What is better than one classifier? Answer: Two classifiers.

> What is better than two classifiers? Answer: Three classifiers.

> What is better than three classifiers? Answer: As many classifiers as you can computationally afford!

This gives rise to the notion of **ensemble techniques** combining multiple classifiers to form a single meta-classifier.  **The idea is that each individual classifier will work on a different part of the domain and then contribute to the overall solution**.

Even though we started the discussion with classifiers this also extends to regressors. To show that the techniques discussed below extend to both classes of models we refer to the machine learning models as "learners".  In particular, we call them "weak learners" in the sense that more often than not the default parameters provided by the library work fine.  It has been demonstrated that ensemble techniques work with extremely weak learners such as decision trees limited to a depth of one.

Two of the more popular approaches to ensemble techniques are **bagging** and **boosting**.

# Bagging

  In bagging the learners act in **parallel** as is demonstrated by the following figure.

  <center>

  <!-- ![figure](https://miro.medium.com/v2/resize:fit:1050/1*a6hnuJ8WM37mLimHfMORmQ.png) -->

<img src="https://miro.medium.com/v2/resize:fit:1050/1*a6hnuJ8WM37mLimHfMORmQ.png"  height="300" width="625">

  </center>

  [source](https://medium.com/@brijesh_soni/boost-your-machine-learning-models-with-bagging-a-powerful-ensemble-learning-technique-692bfc4d1a51)

Notice that in steps 1-3 each of the learners is trained on a different **bootstrap sample** of the original dataset (samples with replacement of the original dataset).  Bootstrap samples have the ability to expose variability of a given training set.  With this in mind we see that each learner is trained on a slightly different dataset exposing different aspects of the original data.  Once each learner has been trained they make a prediction (step 4).  These predictions are then aggregated in step 5.  

For classification the aggregation is usually some form of **voting** and for regression problems it is common to take the **mean** of the various predictions.



## Random Forests

Perhaps the most well known machine learning model based on bagging is the Random Forest where each learner is either a classification or regression tree depending on whether we are looking at a classification or regression problem.

The interesting thing about random forests is that they not only bootstrap sample the data rows but they also perform something called **feature bagging** which means each tree is trained on a **random sample of features** instead of the entire feature set. This helps to improve the diversity of the trees in a similar way as creating the bootstrap samples of the data rows.

Because of feature bagging **random forests tend to work really well with high-dimensional problems** such as text classification.


## Text Classification with Random Forests

From our work before we know that text classification tends to be very high-dimensional due to the vector model.

In [3]:
# get the newsgroup database
#newsgroups = pd.read_csv(home+"newsgroups.csv")
newsgroups = pd.read_csv(home+"newsgroups-noheaders.csv")
newsgroups.head(n=10)

Unnamed: 0,text,label
0,\nIn billions of dollars (%GNP):\nyear GNP ...,space
1,ajteel@dendrite.cs.Colorado.EDU (A.J. Teel) w...,space
2,\nMy opinion is this: In a society whose econ...,space
3,"Ahhh, remember the days of Yesterday? When we...",space
4,"\n""...a la Chrysler""?? Okay kids, to the near...",space
5,"\n As for advertising -- sure, why not? A N...",politics
6,"\n What, pray tell, does this mean? Just who ...",space
7,\nWhere does the shadow come from? There's no...,politics
8,^^^^^^^^^...,politics
9,"#Yet, when a law was proposed for Virginia tha...",space


In [4]:
# construct the docterm matrix
docterm = dsutils.docterm_matrix(newsgroups['text'], 
                                min_df=2, 
                                stem=True, 
                                stop_words='english', 
                                token_pattern="[a-zA-Z]+")
docterm

Unnamed: 0,aa,abandon,abbey,abc,abil,abl,aboard,abolish,abort,abroad,...,yugoslavia,yup,z,zealand,zenit,zero,zeta,zip,zone,zoo
doc0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
doc1033,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1034,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1035,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
doc1036,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# set up train and test sets
X_train, X_test, y_train, y_test = \
  train_test_split(docterm,   # as X
                   newsgroups['label'],  # as y
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

The tricky part with random forest is to figure out how many learners/estimators to incorporate into the meta-model.  Here is a simple rule of thumb:

$$
{\rm n\_estimators} = \frac{2*{\rm n\_features}}{\sqrt{\rm n\_features}}
$$

In our text classification case we have ${\rm n\_features}\approx 6000$ therefore,

$$
{\rm n\_estimators} = \frac{2*6000}{\sqrt{6000}} = 155 \approx 200
$$

Given that this is just a rule-of-thumb it is a good idea to round up to the nearest round integer.

In [6]:
# random forest model
model = RandomForestClassifier(
    n_estimators = 200,    # see rule of thumb above
    random_state = 0,
)

In [7]:
model.fit(X_train, y_train)

In [8]:
dsutils.acc_score(model, X_test, y_test, as_string=True)

'Accuracy: 0.92 (0.89, 0.96)'

In [9]:
dsutils.confusion_matrix(model, X_test, y_test)

Unnamed: 0,politics,space
politics,97,12
space,4,95


**Observation**: The random forest performs significantly better than the decision tree on this data set.  Consider the following results,

**decision tree**: Accuracy: 0.74 (0.68,0.80)

**random forest**: Accuracy: 0.92 (0.89, 0.96)

**naive bayes**: Accuracy: 0.96 (0.93,0.98)

Given these results we can see that the performance difference between decision trees and random forests is statistically significant.  We also see that the performance difference between random forests and naive bayes is **not** statistically significant.


## Handwritten Digit Classification

Another high-dimensional problem encountered was the handwritten digit classification problem with a 64-dimensional space.

In [10]:
# fetch dataset
digits = pd.read_csv(home+'optdigits.csv', header=None)
digits.columns = ['a'+str(i) for i in range(1,65)] + ['digit']
digits

Unnamed: 0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,...,a56,a57,a58,a59,a60,a61,a62,a63,a64,digit
0,0,1,6,15,12,1,0,0,0,7,...,0,0,0,6,14,7,1,0,0,0
1,0,0,10,16,6,0,0,0,0,7,...,0,0,0,10,16,15,3,0,0,0
2,0,0,8,15,16,13,0,0,0,1,...,0,0,0,9,14,0,0,0,0,7
3,0,0,0,3,11,16,0,0,0,0,...,0,0,0,0,1,15,2,0,0,4
4,0,0,5,14,4,0,0,0,0,0,...,0,0,0,4,12,14,7,0,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5615,0,0,4,10,13,6,0,0,0,1,...,0,0,0,2,14,15,9,0,0,9
5616,0,0,6,16,13,11,1,0,0,0,...,0,0,0,6,16,14,6,0,0,0
5617,0,0,1,11,15,1,0,0,0,0,...,0,0,0,2,9,13,6,0,0,8
5618,0,0,2,10,7,0,0,0,0,0,...,0,0,0,5,12,16,12,0,0,9


In [11]:

# data
X = digits.drop(columns=['digit'])
y = digits[['digit']]

In [12]:
# setting up training/testing data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y['digit'], # we want a series as target
    train_size=0.8,
    test_size=0.2,
    random_state=1
)

Let's do the rule-of-thumb calculation again to give us the number of estimators with ${\rm n\_features} = 64$,

$$
{\rm n\_estimators} = \frac{2*64}{\sqrt{64}} = 16 \approx 20
$$


In [13]:
# model object
model = RandomForestClassifier(
    n_estimators = 20,     # rule of thumb
    random_state = 0
)

In [14]:
model.fit(X_train, y_train)

In [15]:
dsutils.acc_score(model, X_test, y_test, as_string=True)

'Accuracy: 0.97 (0.96, 0.98)'

In [16]:
dsutils.confusion_matrix(model, X_test, y_test)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,104,0,0,0,1,0,0,0,0,0
1,0,126,1,0,0,0,0,2,0,1
2,0,0,91,0,0,0,0,0,0,0
3,0,0,0,115,0,1,0,0,0,2
4,0,0,0,0,106,0,0,1,0,1
5,0,0,0,1,0,94,0,0,0,1
6,0,0,0,0,0,1,110,0,1,0
7,0,0,0,1,0,0,0,129,0,0
8,0,2,2,1,0,0,3,0,112,1
9,1,0,0,1,1,1,0,0,1,108


**Observation**: Very few mistakes and no digit stands out in terms of mistakes.  Furthermore, the performance increase over decision trees is statistically significant!

**decision tree**: Accuracy: 0.92 (0.90, 0.93)

**random forest**: Accuracy: 0.97 (0.96, 0.98)

The confidence intervals do not overlap!

## Takeaway

Random forests demonstrates that **bagging significantly improves the performance** of learners such as decision trees via an ensemble techniques based on bootstrap samples and feature bagging. 

As we have seen with ANNs, bagging models tend to have too many parameters to do an effective grid search and just as in ANNs **we heavily rely on rule-of-thumb models to get to reasonable models**.

# Boosting

The kind of boosting we are talking about here is **gradient boosting** where learners act in serial trying to "rectify" the mistakes that the previous learner made.  This gives rise to the following figure.


<center>

<!-- ![figure](https://miro.medium.com/v2/resize:fit:1358/1*4XuD6oRrgVqtaSwH-cu6SA.png) -->

<img src="https://miro.medium.com/v2/resize:fit:1358/1*4XuD6oRrgVqtaSwH-cu6SA.png"  height="300" width="625">


</center>

[source](https://medium.com/@brijeshsoni121272/understanding-boosting-in-machine-learning-a-comprehensive-guide-bdeaa1167a6)

Here we gradient boosted model with m stages.  After training and testing the initial stage (steps 1-3) the mistakes of the first stage are incorporated into the training of the second stage (steps 4,5).  This pattern is repeated until the last stage can provide the overall prediction.  It is considered an ensembled technique because the boosted model consists of many learning models.

The name "gradient boosting" comes from the fact that step 4 can be interpreted as a **gradient descent optimization** of the loss function (a function that describes the errors a learner makes).

Here we look at the implementation of gradient boosting as implemented in sklearn.  We use the same two examples we used for decision trees in order to study gradient boosting.

## Text Classification with Gradient Boosting

In [17]:
# set up train and test sets -- use the docterm matrix from above
X_train, X_test, y_train, y_test = \
  train_test_split(docterm,   # as X
                   newsgroups['label'],  # as y
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

In [18]:
# model object
model = GradientBoostingClassifier(
    n_estimators = 400,  # the more stages the better the performance
    random_state = 0
)

In [19]:
# train model
model.fit(X_train, y_train)

In [20]:
dsutils.acc_score(model, X_test, y_test, as_string=True)

'Accuracy: 0.91 (0.87, 0.95)'

In [21]:
dsutils.confusion_matrix(model, X_test, y_test)

Unnamed: 0,politics,space
politics,104,5
space,14,85


**Observation**:  Consider the following results,

**decision tree**: Accuracy: 0.74 (0.68,0.80)

**gradient boosting**: Accuracy: 0.91 (0.87,0.95)

**random forest**: Accuracy: 0.92 (0.89, 0.96)

**naive bayes**: Accuracy: 0.96 (0.93,0.98)




## Handwritten Digit Classification

In [22]:
# use the digits data from above
X = digits.drop(columns=['digit'])
y = digits[['digit']]

In [23]:
# setting up training/testing data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y['digit'], 
    train_size=0.8,
    test_size=0.2,
    random_state=1
)

In [24]:
# model object
model = GradientBoostingClassifier(
    n_estimators = 300,  # the more stages the better the performance
    random_state = 0
)

In [25]:
# train the model
model.fit(X_train, y_train)

In [26]:
dsutils.acc_score(model, X_test, y_test, as_string=True)

'Accuracy: 0.99 (0.98, 0.99)'

In [27]:
dsutils.confusion_matrix(model, X_test, y_test)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,105,0,0,0,0,0,0,0,0,0
1,0,128,0,0,0,0,0,1,0,1
2,0,0,91,0,0,0,0,0,0,0
3,0,0,0,116,0,0,0,0,1,1
4,0,0,0,0,106,0,0,1,0,1
5,0,0,0,0,0,95,0,0,0,1
6,0,0,0,0,1,1,110,0,0,0
7,0,0,0,2,1,0,0,127,0,0
8,0,0,0,0,0,0,1,0,119,1
9,1,0,0,0,0,0,0,0,1,111


**Observation**: Very few mistakes. The digit seven is being misclassified as three in 2 instances.  

**decision tree**: Accuracy: 0.92 (0.90, 0.93)

**random forest**: Accuracy: 0.97 (0.96, 0.98)

**gradient boosting**: Accuracy: 0.99 (0.98, 0.99)



## Takeaway

The performances of random forests and gradient boosting are comparable.  Both boosting strategies significantly improve performance in both of our high-dimensional domains.

Again, boosting models are too complex and take too long to train to be able to implements an effective grid search strategy.  Unfortunately, there are no nice rules of thumb to implement here.  **Perhaps the best approach is to start with the default of 100 'n_estimators' and then increase this value steadily until performance levels off or until training time becomes too prohibitive.**

# Ensemble Techniques and Regression

All the ensemble models discussed here are also available as regression models in sklearn: **RandomForestRegressor** and **GradientBoostingRegressor**.