# COSC 411: Artificial Intelligence

Instructor: Dr. Shuangquan (Peter) Wang

Email: spwang@salisbury.edu

Department of Computer Science, Salisbury University


# Module 3_ML algorithms and application

## 5. Ensemble Methods



**Contents of this note refer to**
1) https://colab.research.google.com/github/Eldave93/Seizure-Detection-Tutorials/blob/master/05.%20Ensemble%20Learning.ipynb#:~:text=Ensemble%20methods%20aim%20to%20improve,general%20methods%2C%20averaging%20and%20boosting
2) Data Science Complete Tutorial. https://github.com/edyoda/data-science-complete-tutorial/blob/master/16.%20Ensemble%20Methods.ipynb
3) Python toturial: https://docs.python.org/3/tutorial/

**<font color=red>All rights reserved. Dissemination or sale of any part of this note is NOT permitted.</font>**

# Ensemble Methods
* Objective of ensemble methods is to combine the predictions of serveral base estimators ( Linear Regression, Decisison Tree, etc. ) to create a combined effect or more generalized model.
* Two types of Ensemble Method
  - Averaging Method : Build several estimators independently & average their predictions. Examples are RandomForest etc.
  - Boosting Method : Base estimators are built sequentially using weighted version of data, i.e. fitting models with data that were mis-classified. Examples are AdaBoost
  
<img src="https://cdn-images-1.medium.com/max/1000/1*PaXJ8HCYE9r2MgiZ32TQ2A.png">

## Bagging
A bagging classifier is an ensemble of base classifiers, each fit on random subsets of a dataset. Their predictions are then pooled or aggregated to form a final prediction. This reduces variance of an estimator and can be a simple way to reduce overfitting. They work best with complex models as opposed to boosting, which work best with weak models<sup>1</sup>.

Specifically, bagging is when sampling is produced with replacement<sup>2</sup>, and without replacement being called pasting<sup>3</sup>. Therefore both bagging and pasting allow training to be sampled several times across multipule predictors, with bagging only allowing several samples for the same predictor <sup>4</sup>.

We can do this with any classifier so lets start with a support vector machine.

**NOTE**
- If we wanted to use pasting we would just set *bootstrap=False*.

---

1. https://scikit-learn.org/stable/modules/ensemble.html
2. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
3. Breiman, L. (1999). Pasting small votes for classification in large databases and on-line. Machine learning, 36(1-2), 85-103.
4. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report

pipe_svc = Pipeline([('scl', StandardScaler()),
                     ('pca', PCA(n_components=0.8, random_state = 0)),
                     ('clf', SVC(kernel='rbf', random_state=0))])

bag = BaggingClassifier(base_estimator=pipe_svc,
                        n_estimators=10,
                        max_samples=0.5,
                        max_features=0.5,
                        bootstrap=True,
                        bootstrap_features=True,
                        oob_score=True,
                        warm_start=False,
                        n_jobs=-1,
                        random_state=0)



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

dataset = load_iris()
#df = pd.DataFrame(dataset.data, columns=data.feature_names)
X = dataset['data']
y = dataset['target']

X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=0)

bag.fit(X_train, y_train)

In [None]:
y_pred = bag.predict(X_test)

display(pd.DataFrame(classification_report(y_test, y_pred , output_dict =True)))

As we can see, performance is okay but recall is particularly poor. It is likely the model is not complex enough or models in the ensemble are too similar to each other (we'll look at solving this soon).

An additional way we can get a performance metric on a validation set is to ensure we use `oob_score = True`

With bagging by default only trains on a sample of the training data, leaving a set of training data sampled as out-of-bag (oob) instances. Since they are not seen during training, we can evalute on them without a separate validation using the oob_score.

It gets an accuracy of about 0.94 on the test/validation set, so it pretty close to what we did get above.

In [None]:
bag.oob_score_

## RandomForest
* Limitations of decison tree is that it overfits & shows high variance.
* RandomForest is an averaging ensemble method whose prediction is function of prediction of 'n' decision trees.

<img src="https://www.researchgate.net/profile/Stavros_Dimitriadis/publication/324517994/figure/fig1/AS:615965951799303@1523869135381/Classification-process-based-on-the-Random-Forest-algorithm-2.png">

##### Algorithm
* Data consist of R rows & M features.
* Sample of training data is taken.
* Random set of features are selected.
* As many as configured number of trees are created using above two steps.
* Final prediction in case of classification is majority prediction.
* Final prediction in case of regression is mean/median of individual tree prediction

##### Comparing Decision Tree & Random Forest for MNIST data

In [None]:
from sklearn.datasets import load_digits
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
digits = load_digits()

In [None]:
X = digits.data
y = digits.target

In [None]:
trainX, testX, trainY, testY = train_test_split(X,y)

In [None]:
dt = DecisionTreeClassifier()

In [None]:
dt.fit(trainX,trainY)

In [None]:
dt.score(testX,testY)

In [None]:
rf = RandomForestClassifier()

In [None]:
rf.fit(trainX,trainY)

In [None]:
rf.score(testX,testY)

##### Important Hyper-parameters
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* n_estimators : number of trees to be configured, larger is better but compute cost.
* max_features : maximum number of features to be considered for splitting the node. For classification this equals to sqrt(n_features). And, for regression max_features = n_features.
* n_jobs : Configure as -1 so that we can make use of all processors.

#### Advantages
* Minimal data cleaning or dealing with missing values required.
* Works well with high dimensional datasets
* Minimizes variance even for low variance models
* RandomForest can tell importance of features. We can find important features & use them in model training

In [None]:
rf.feature_importances_

## AdaBoost
* Boosting in general is about building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

##### Algorithm
* Core concept of adaboost is to fit weak learners ( like decision tree ) sequantially on repeatedly modifying data.
* Initially, each data is assigned equal weights.
* A base estimator is fitted with this data.
* Weights of misclassified data are increased & weights of correctly classified data is decreased. 
* Repeat the above two steps till all data are correctly classified or max number of iterations configured.
* Making Prediction : The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
ab = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=8),n_estimators=600)

In [None]:
ab.fit(trainX,trainY)

In [None]:
ab.score(testX,testY)

In [None]:
ab = AdaBoostClassifier(base_estimator=RandomForestClassifier(n_estimators=20),n_estimators=600)

In [None]:
ab.fit(trainX,trainY)

In [None]:
ab.score(testX,testY)