# Introduction to Ensemble Learning Methods

*By: Robert Jackson*

No model performs better across all types of classificaion or regression problems (i.e. No Free Lunch Theorem): Models' performance will vary given the task. This also means that for any given obsevartion, different models are not guaranteed to give us identical answers. Because of this, it is sometimes better to use and ENSEMBLE METHOD.

* A group of models is referred to as an ENSEMBLE (Hence ENSEMBLE Learning)
* Can be used for classification or regression tasks

One example is the use of multiple Decision Tree (also known as a Random Forest). In this example we'll use Decision Tree Classifiers
1. Train n-number of Decision Trees on RANDOM subsets of the training data
2. We obtain n-number of predictions
3. The class with the highest frequency of 'votes' (model predictions) becomes the Ensemble's prediction

Ensembles are not restricted to Decision Trees, and can be composed of a variety of classification and regression models.

Also called: Committee-learning, classifier combination, multiple classifier systems

## Voting Classifier

When we want to use an assortment of classifiers, we use a VOTING classifier

In [19]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC

from sklearn import datasets

import warnings
warnings.simplefilter('ignore')

In [20]:
iris = datasets.load_iris()

In [21]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [22]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [23]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [24]:
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_df['Class'] = iris.target

In [25]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [26]:
x = iris_df.drop(['Class'], axis=1)
y = iris_df['Class']

In [27]:
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.7, random_state = 0)

In [28]:
logReg = LogisticRegression()
decTree = tree.DecisionTreeClassifier()
svm = SVC(probability = True)

In [29]:
voter_hard = VotingClassifier(
                estimators=[('lr', logReg), ('dt', decTree), ('svc', svm)],
                voting='hard'
                )

voter_soft = VotingClassifier(
                estimators=[('lr', logReg), ('dt', decTree), ('svc', svm)],
                voting='soft'
                )

In [30]:
voter_hard.voting

'hard'

In [31]:
from sklearn.metrics import accuracy_score

for clf in (logReg, decTree, svm, voter_hard, voter_soft):
    type_ = ''
    clf.fit(xTrain, yTrain)
    y_pred = clf.predict(xTest)
    if clf.__class__.__name__ == 'VotingClassifier':
        if clf.voting =='hard':
            type_ = ' (Hard)'
        else:
            type_ = ' (Soft)'
    else:
        pass
    print(clf.__class__.__name__ + type_ + ':', accuracy_score(yTest, y_pred))

LogisticRegression: 0.9333333333333333
DecisionTreeClassifier: 0.9428571428571428
SVC: 0.9428571428571428
VotingClassifier (Hard): 0.9619047619047619
VotingClassifier (Soft): 0.9523809523809523


#### Types of Voters
    
   [Hard Voters](https://drive.google.com/file/d/1RSzUhN9o4qq-wnwmgv3RyJMy8XRu6CBx/view?usp=sharing) determine class by a frequency.

   [Soft Voters](https://drive.google.com/file/d/1ScGkjxgM4IQ6ds8tBjFLAGQl6UHU4cox/view?usp=sharing) determine class by weighted probability.


   



#### Why does ensemble learning work at least as well as it's single best model?
    
   By runing multiple models on random subsets of the data, we're able to better generalize. The caveat to this is that ensembles generate knowledge that isn't always understandable to the user.


   



In [32]:
#svm.predict_proba(x)

## Bagging

You shold all be familiar with the idea of cross-validation and K-fold. With K-fold, we are sampling WITHOUT replacement. This means that our subset data isn't ncessarily randomly generated. When we sample WITH replacement, we are generating truly random subsets of data. This is referred to as BAGGING (also known as BOOSTRAP BAGGREGATING)

In [33]:
from sklearn.ensemble import BaggingClassifier

In [34]:
bagger = BaggingClassifier(
            LogisticRegression(), n_estimators=100,
            max_samples=45, bootstrap=True, oob_score = True,
            n_jobs=-1
            )

bagger1 = BaggingClassifier(
            LogisticRegression(), n_estimators=1,
            max_samples=7, bootstrap=True, oob_score = True,
            n_jobs=-1
            )

In [35]:
bagger.fit(xTrain, yTrain)
bagger1.fit(xTrain, yTrain)

BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=7, n_estimators=1, n_jobs=-1, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [36]:
bagg_pred = bagger.predict(xTest)
bagg1_pred = bagger1.predict(xTest)

In [37]:
accuracy_score(yTest, bagg_pred)

0.9333333333333333

In [38]:
accuracy_score(yTest, bagg1_pred)

0.6285714285714286

In [39]:
bagger.oob_score_

0.9555555555555556

## Random Forests

As mentioned at the beginning of this notebook, a RANDOM FOREST is an ensemble of Decision Trees. 

*Note: RandomForestClassifier() has MOST, but not all of the hyperparameters of DecisionTreeClassifier()*

Random Forests don't select the best predictor for splitting on for each tree, but selects the best from a random subset of predictors. 

Why is this important?


In [40]:
forrest_gump = RandomForestClassifier(
                n_estimators=100, 
                max_leaf_nodes=16,
                n_jobs=-1
                )




forrest_gump.fit(xTrain, yTrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [41]:
forrest_pred = forrest_gump.predict(xTest)

In [42]:
accuracy_score(yTest, forrest_pred)

0.9619047619047619

## GridSearchCV (Grid Search Cross-Validation)

In [43]:
param_grid = {
    
    "criterion": ["gini", "entropy"],
    "splitter": ['best', 'random'],
    "max_depth": [depth for depth in range(2,20)],
    "min_samples_split": [minimum for minimum in range(2,20)]
    #'max_leaf_nodes': [leaves for leaves in range(10,20)]
    
    }


decTree2 = tree.DecisionTreeClassifier()

In [44]:
Grid_Tree = GridSearchCV(decTree2, param_grid, cv=5)

In [45]:
Grid_Tree.fit(xTrain, yTrain)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], 'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [46]:
Grid_Tree.best_params_

{'criterion': 'gini',
 'max_depth': 2,
 'min_samples_split': 12,
 'splitter': 'random'}

In [47]:
BestDecTree = tree.DecisionTreeClassifier('gini',max_depth = 8, min_samples_split = 14,splitter = 'random')
BestDecTree.fit(xTrain, yTrain)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=14,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='random')

In [48]:
Best_Pred = BestDecTree.predict(xTest)

In [49]:
accuracy_score(yTest, Best_Pred)

0.9142857142857143

# Additional Resources
sources on Enesmble Methods

* (Academic Summary of Ensemble Learing) https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/springerEBR09.pdf

* (Variation - Stochastic Gradient Descent) https://scikit-learn.org/stable/modules/sgd.html

* https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.Logit.html#statsmodels.discrete.discrete_model.Logit

## Example of stacking:

In [51]:
boston = datasets.load_boston()

In [53]:
bos_df = pd.DataFrame(boston.data, columns=boston.feature_names)
bos_df["Price"] = boston.target

In [54]:
bos_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [55]:
from sklearn.linear_model import LinearRegression

In [58]:
x = bos_df.drop(["Price"], axis=1)
y = bos_df["Price"]

xTrain1, xTest1, yTrain1, yTest1 = train_test_split(x, y, test_size=.3, random_state=2)
xTrain2, xTest2, yTrain2, yTest2 = train_test_split(x, y, test_size=.3, random_state=3)
xTrain3, xTest3, yTrain3, yTest3 = train_test_split(x, y, test_size=.3, random_state=4)
xTrain4, xTest4, yTrain4, yTest4 = train_test_split(x, y, test_size=.3, random_state=5)

In [59]:
linreg1 = LinearRegression()
linreg2 = LinearRegression()
linreg3 = LinearRegression()

In [60]:
linreg1.fit(xTrain1, yTrain1)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [61]:
linreg2.fit(xTrain2, yTrain2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [62]:
linreg3.fit(xTrain3, yTrain3)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [64]:
# simply taking the average of each linear predictor
stack_predictions = (linreg1.predict(xTest1) +
                     linreg2.predict(xTest2) + 
                     linreg3.predict(xTest3)) / 3

In [65]:
from sklearn.metrics import mean_squared_error

In [66]:
np.sqrt(mean_squared_error(yTest4, stack_predictions))

11.214521158108445

In [67]:
# compared to each single prediction
pred1 = linreg1.predict(xTest1)
pred2 = linreg2.predict(xTest2)
pred3 = linreg3.predict(xTest3)
print(np.sqrt(mean_squared_error(yTest4, pred1)))
print(np.sqrt(mean_squared_error(yTest4, pred2)))
print(np.sqrt(mean_squared_error(yTest4, pred3)))

13.011058900816131
12.445519571487386
13.291414904805892


So we can see that the stack prediction performed better than any individual predictor