# Ensemble methods


# Bagging and random forests

See slides for details.  We can try a demo on the LFW faces dataset:

In [2]:
from sklearn.datasets import fetch_lfw_people
from sklearn.ensemble import RandomForestClassifier

faces = fetch_lfw_people(min_faces_per_person = 100)

In [3]:
from sklearn.model_selection import train_test_split

face_f_train, face_f_test, face_l_train, face_l_test = \
    train_test_split(faces['data'], faces['target'], \
                     test_size=0.1, random_state=4100)
forest = RandomForestClassifier(n_estimators=200).fit(face_f_train, face_l_train)
forest.score(face_f_test,face_l_test)

0.7017543859649122

We can ask which features were most important; in scikit-learn, this is the total gain from asking questions related to that feature (although scikit-learn uses gini index, a slightly different metric for impurity/entropy).  High numbers indicate important features.  For faces, it's hard to imagine any particular pixel being very important.

In [4]:
for i in range(20):
  print(forest.feature_importances_[i])

0.001637109720292415
0.002261037570677264
0.001297770846192008
0.00048482589923869
0.0003599372020477337
0.0002129559793343158
0.00031461029415769305
0.0002138831103211658
0.00041805154736268416
0.00020639372616406153
0.00046046349010493877
0.00048193107131307
0.000305110738771393
0.00023990728077692805
0.00011662035744837266
0.00017657437059042218
9.956583799868253e-05
0.0001639350707825467
0.00019253082967936108
1.1118727089976617e-05


On the iris dataset, we get both much better performance and an easier to read feature importance list.

In [6]:
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

In [7]:
features_train, features_test, labels_train, labels_test = \
  train_test_split(iris['data'], iris['target'], test_size=0.1)
irisforest = RandomForestClassifier(n_estimators=200, random_state=110)
irisforest.fit(features_train, labels_train)

irisforest.score(features_test, labels_test)

1.0

In [8]:
irisforest.feature_importances_

array([0.10038726, 0.0258324 , 0.43240505, 0.44137529])

In [9]:
iris["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

# Boosting example

Boosting is, like bagging, an ensemble machine learning method, training groups of classifiers to vote on new examples.  But boosting adaptively changes the training set to try to fix the errors of previous learners; see slides for more details.

In [11]:
from sklearn.datasets import fetch_lfw_people
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

faces = fetch_lfw_people(min_faces_per_person = 100)
face_f_train, face_f_test, face_l_train, face_l_test = \
   train_test_split(faces['data'],faces['target'], test_size=0.1, random_state=110)
clf = AdaBoostClassifier(n_estimators=200)
clf.fit(face_f_train, face_l_train) # This takes a little while
print(clf.score(face_f_train,face_l_train))
print(clf.score(face_f_test,face_l_test))

0.7719298245614035
0.6929824561403509
