# Ensemble Learning and Methods

* Ensemble method is where we train multiple models
* Predictions are made by aggregating the predictions from each model in the ensemble
    *In a classification problem, the class with the most vote wins
    * A majority-vote classifier is also called a hard-voting classifier 
* Even if each model is a weak learner, the ensemble can still be a strong learner provided that there are sufficient number of weak learners and they are sufficiently diverse
* The above is true if:
    * all the models are perfectly independent, making uncorrelated errors which may not be the case if they are trained on the same data
	* If they are trained on the same data, they will most likely make the same type of errors hence there will be many majority votes for the wrong class, reducing the ensemble's accuracy
Ensemble methods work best when the models are as independent from one another as possible

In [1]:
import sklearn.datasets as datasets
import sklearn.ensemble as ensemble
import sklearn.tree as tree
import sklearn.linear_model as lm
import sklearn.svm as svm
import sklearn.metrics as metrics
import sklearn.model_selection as ms
import numpy as np

In [2]:
X, y = datasets.make_classification(n_samples=15000, n_features=100, n_informative=25, n_redundant=10, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, train_size= 0.8)

In [3]:
X.shape

(15000, 100)

In [4]:
log_clf = lm.LogisticRegression(random_state=42)
rf_clf = ensemble.RandomForestClassifier(random_state=42)
svm_clf = svm.SVC(random_state=42) # currently do not have the predict_proba method and hence cannot output class probabilities

voting_classifier = ensemble.VotingClassifier(estimators=[('lc', log_clf),
                                ('rf', rf_clf),
                                ('svc', svm_clf)], voting='hard')

voting_classifier.fit(X_train, y_train)

VotingClassifier(estimators=[('lc', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])

In [5]:
classifer_list = (log_clf, rf_clf, svm_clf, voting_classifier)

for classifier in classifer_list:
    classifier.fit(X_train, y_train)
    y_prediction_test = classifier.predict(X_test)
    y_prediction_accuracy = metrics.accuracy_score(y_test, y_prediction_test)
    print(f'{classifier.__class__.__name__} : {y_prediction_accuracy * 100}')

LogisticRegression : 78.86666666666666
RandomForestClassifier : 92.60000000000001
SVC : 96.8
VotingClassifier : 93.76666666666667


* Hard-voting classifier (majority-vote classifier) is where we aggregate the predictions of each classifier and predict the class that gets the most vote
* Soft-voting classifier is where we predict the class with the highest class probability, averaged over all the individual classifiers
    * This often achieve higher performance than hard-voting becasue it give more weight to highly confident vote
    * Soft-voting is only possible if all the classifiers are able to estimate class probabilities, i.e. they all have a predict_proba() method

In [6]:
log_clf = lm.LogisticRegression(random_state=42)
rf_clf = ensemble.RandomForestClassifier(random_state=42)
svm_clf = svm.SVC(random_state=42, probability=True) # this will create a predict_proba and can output class probabilities

voting_classifier = ensemble.VotingClassifier(estimators=[('lc', log_clf),
                                ('rf', rf_clf),
                                ('svc', svm_clf)], voting='soft') # change to soft-voting method 

voting_classifier.fit(X_train, y_train)

classifer_list = (log_clf, rf_clf, svm_clf, voting_classifier)
for classifier in classifer_list:
    classifier.fit(X_train, y_train)
    y_prediction_test = classifier.predict(X_test)
    y_prediction_accuracy = metrics.accuracy_score(y_test, y_prediction_test)
    print(f'{classifier} : {y_prediction_accuracy * 100}')

LogisticRegression(random_state=42) : 78.86666666666666
RandomForestClassifier(random_state=42) : 92.60000000000001
SVC(probability=True, random_state=42) : 96.8
VotingClassifier(estimators=[('lc', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(probability=True, random_state=42))],
                 voting='soft') : 95.36666666666666


# Bagging and Pasting

* One method to get a diverse set of models is to use the same set of training algorithms and train them on differnt random subsets of the training data
* The random subsets of data is created via random sampling of the original dataset
* If the random sampling is done with replacement (i.e. the same instance can be picked more than once), it is called Bagging
* If the random sampling is done without replacement, it is called Pasting
* Bootstrapping introduces a bit more diversity into the subsets where each model is trained on hence bagging will end up with a model with slightly higher bias than Pasting
* However, the diversity means that the models in the ensemble will end up being less correlated with each other hence the variance is reduced

In [7]:
tree_clf = tree.DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(tree_clf, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

#testing prediction
tree_clf_test_prediction = tree_clf.predict(X_test)
tree_clf_test_accuracy = metrics.accuracy_score(y_test, tree_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(tree_clf_test_accuracy * 100)

[0.78583333 0.78333333 0.77041667 0.78541667 0.79291667 0.78458333
 0.77083333 0.77916667 0.79666667 0.78166667 0.80041667 0.78875
 0.785      0.77791667 0.78166667]
78.43055555555554
0.8028690836195516

Testing Prediction Accuracy
79.80000000000001


In [8]:
bag_clf = ensemble.BaggingClassifier(
                            tree.DecisionTreeClassifier(random_state=42), 
                            n_estimators=500, # number of trees/models/predictors in the ensemble
                            max_samples=100, # number of samples to be used in the bagging process to train each tree in the ensemble
                            bootstrap=True,
                            n_jobs=-1,
                            random_state=42, 
                            oob_score= True)
bag_clf.fit(X_train, y_train)

# the bagging classifier class will automatically use soft-voting method if the underlying estimator can estimate class probabilities

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(bag_clf, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

print()
print(f'oob score : {bag_clf.oob_score_}')

bag_clf_test_prediction = bag_clf.predict(X_test)
bag_clf_test_accuracy = metrics.accuracy_score(y_test, bag_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(bag_clf_test_accuracy*100)

[0.79458333 0.785      0.78166667 0.77458333 0.77958333 0.78708333
 0.78458333 0.78875    0.7825     0.77791667 0.78666667 0.7875
 0.78875    0.78166667 0.7825    ]
78.42222222222223
0.4810931482404946

oob score : 0.7835

Testing Prediction Accuracy
78.76666666666667


In [9]:
bag_clf.oob_decision_function_

array([[0.47368421, 0.52631579],
       [0.34747475, 0.65252525],
       [0.38709677, 0.61290323],
       ...,
       [0.26060606, 0.73939394],
       [0.72635815, 0.27364185],
       [0.57228916, 0.42771084]])

* With bagging, some instances will not get picked by any of the models
* The instances that are not picked are known as out-of-bag instances 
* If 60% of the training instances are sampled on average for each model, this means that 40% of the instances are the OOB instances
* Note that they are not the same 40% across all the models in the ensemble
* Since the OOB is not used during the training process, it can used to evaluate the model similar to the testing dataset

# Random Patches and Random Subspace

* Instead of sampling the instances, we can sample the input features as well
* Random Subspace is a training method where we keep all the training instances and perform sampling ONLY on the input features
* Random Patches is a training methiod where we perform random sampling on BOTH the training instances and the input features 
* Once again, random subspace is another technique for us to introduce more diversity into the model to reduce variance at the expense of potentially slightly higher bias 



In [10]:
new_rf_clf = ensemble.RandomForestClassifier(
                                            n_estimators=500, 
                                            bootstrap=True, # random sampling of instances
                                            max_samples=100, 
                                            max_features=25, # random sampling of input features
                                            oob_score=True,
                                            random_state=42, 
                                            n_jobs=-1
                                            )
new_rf_clf.fit(X_train, y_train)

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(new_rf_clf, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

print()
print(f'oob score : {new_rf_clf.oob_score_}')

new_bag_clf_test_prediction = new_rf_clf.predict(X_test)
new_bag_clf_test_accuracy = metrics.accuracy_score(y_test, new_bag_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(new_bag_clf_test_accuracy*100)

[0.80791667 0.79875    0.79125    0.78916667 0.79541667 0.80041667
 0.78875    0.79791667 0.79375    0.79833333 0.79583333 0.80375
 0.79791667 0.79375    0.7925    ]
79.63611111111109
0.5063181062250862

oob score : 0.7979166666666667

Testing Prediction Accuracy
79.4


# Extra_Tree

* The extremely randomized trees model is another model aims to reduce variance at the expense of higher bias

In [11]:
e_trees = ensemble.ExtraTreesClassifier(
                                        n_estimators=500, 
                                        bootstrap=True, # random sampling of instances
                                        max_samples=100, 
                                        max_features=25, # random sampling of input features
                                        oob_score=True,
                                        random_state=42, 
                                        n_jobs=-1
                                        )
e_trees.fit(X_train, y_train)

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(e_trees, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

print()
print(f'oob score : {e_trees.oob_score_}')

e_tree_clf_test_prediction = e_trees.predict(X_test)
e_tree_clf_test_accuracy = metrics.accuracy_score(y_test, e_tree_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(e_tree_clf_test_accuracy*100)

[0.81833333 0.81208333 0.81041667 0.79208333 0.80125    0.81083333
 0.80625    0.81       0.80291667 0.80291667 0.81041667 0.80375
 0.79458333 0.81041667 0.79291667]
80.52777777777777
0.7398490002763247

oob score : 0.8014166666666667

Testing Prediction Accuracy
79.76666666666667


# AdaBoost Classifier

* A base classifier model is trained and used to make predictions on the training dataset
* From the predictions, INCREASE the weight of incorrectly classified instances
* Trained a second classifier model with the updated weights and use it to make predictions on the training dataset 
etc. and etc...

In [12]:
adaboo_clf = ensemble.AdaBoostClassifier(
                                    base_estimator=tree.DecisionTreeClassifier(max_depth=1),
                                    n_estimators=200, 
                                    algorithm='SAMME.R',
                                    learning_rate=0.5, 
                                    random_state=42,        
                                        )
adaboo_clf.fit(X_train, y_train)

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(adaboo_clf, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

adaboo_clf_test_prediction = adaboo_clf.predict(X_test)
adaboo_clf_test_accuracy = metrics.accuracy_score(y_test, adaboo_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(adaboo_clf_test_accuracy*100)

[0.82541667 0.8325     0.8225     0.82666667 0.82083333 0.82791667
 0.81041667 0.82625    0.82541667 0.82916667 0.82958333 0.82708333
 0.8175     0.82458333 0.82375   ]
82.46388888888887
0.5192735904399485

Testing Prediction Accuracy
82.33333333333334


# Gradient Boosting Classifier

In [19]:
gb_clf = ensemble.GradientBoostingClassifier(random_state=42)

gb_clf.fit(X_train, y_train)

cv = ms.RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
accuracy_score_cv = ms.cross_val_score(gb_clf, X_train, y_train, scoring='accuracy', cv=cv)
print(accuracy_score_cv)
print(np.mean(accuracy_score_cv) * 100)
print(np.std(accuracy_score_cv) * 100)

gb_clf_test_prediction = gb_clf.predict(X_test)
gb_clf_test_accuracy = metrics.accuracy_score(y_test, gb_clf_test_prediction)
print()
print('Testing Prediction Accuracy')
print(gb_clf_test_accuracy*100)

[0.8925     0.90166667 0.89333333 0.88666667 0.8925     0.89583333
 0.89041667 0.88916667 0.88916667 0.89666667 0.89041667 0.90083333
 0.89583333 0.89083333 0.89333333]
89.32777777777777
0.41075434970608515

Testing Prediction Accuracy
89.26666666666667
