# HOMEWORK 2.2: Ensemble Learning

# Omnia Elmenshawy -  2000007 

# Description of HW2.2

The aim of this assignment is to use Ensemble Learning to solve a problem. In this way, it is aimed to understand the benefits of Ensemble Learning and to teach the usage details of different ensemble approaches with the help of ScikitLearn library. 

The following methods will be implemented within the scope of this assignment. These are:
- Voting Classifier (hard & soft voting) 


In [1]:
# Common imports
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

In [2]:
# to make this notebook's output stable across runs
np.random.seed(42)

In [3]:
# generate moon dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# TODO 2.2.1 Voting classifier in Scikit-Learn
- Train voting classifiers in Scikit-Learn, composed of at least three diverse classifiers


In [4]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

clf1 = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,  max_depth=1, random_state=0)
clf2 = RandomForestClassifier(random_state=42)
clf3 = LogisticRegression(random_state=42)

#define your hard voting classifier
voting_clf_hard = VotingClassifier(
    estimators=[('GB', clf1), ('rf', clf2), ('lr', clf3)],
    voting='hard')

#Extra: defining my soft classifier

voting_clf_soft = VotingClassifier(
    estimators=[('GB', clf1), ('rf', clf2), ('lr', clf3)],
    voting='soft')

#train your voting classifier using X_train, y_train
voting_clf_hard.fit(X_train, y_train)
voting_clf_soft.fit(X_train, y_train)


#write obtained individual classifiers. Compare them with "Voting Classifiers (hard and soft)" for "X_test/y_test" data
from sklearn.metrics import accuracy_score


for model in (clf1, clf2, clf3, voting_clf_hard):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("Hard voting classifiers result: ", model.__class__.__name__, accuracy_score(y_test, y_pred))
    
print(" ")

for model in (clf1, clf2, clf3, voting_clf_soft):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("Soft voting classifiers result: ", model.__class__.__name__, accuracy_score(y_test, y_pred))





Hard voting classifiers result:  GradientBoostingClassifier 0.872
Hard voting classifiers result:  RandomForestClassifier 0.896
Hard voting classifiers result:  LogisticRegression 0.864
Hard voting classifiers result:  VotingClassifier 0.904
 
Soft voting classifiers result:  GradientBoostingClassifier 0.872
Soft voting classifiers result:  RandomForestClassifier 0.896
Soft voting classifiers result:  LogisticRegression 0.864
Soft voting classifiers result:  VotingClassifier 0.896


## Based on the previous result, the hard classifier has a better accuracy.

# TODO 2.2.2 Bagging/Pasting
- One way to get a diverse set of classifiers is to use very different training algorithms
- Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set
    - When sampling is performed with replacement, this method is called ***bagging*** 
        - short for ***bootstrap aggregating***
    - When sampling is performed without replacement, it is called pasting
- Both bagging and pasting allow training instances to be sampled several times across multiple predictors
- Only bagging allows training instances to be sampled several times for the same predictor
- Predictors can all be trained in parallel, via different CPU cores or even different servers.
- Similarly, predictions can be made in parallel.
- This is one of the reasons why bagging and pasting are such popular methods: they scale very well.

In [5]:
# Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class 
#  
#   (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False). 
# The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions 
#   (–1 tells Scikit-Learn to use all available cores):

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


# Write Bagging Classifier and train it using ensemble of 500 Decision Tree classifiers, 
#  each trained on 100 training instances randomly sampled from the training set with replacement 

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)



# bagging accuracy score
# Using X_test dataset calculate/print Bagging Accuracy Score (using accuracy_score metric)
from sklearn.metrics import accuracy_score

print("accuracy score of bagging classifier is: ", accuracy_score(y_test, y_pred))

print(" ")

#Pasting classifier:
paste_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=False, n_jobs=-1, random_state=42)

paste_clf.fit(X_train, y_train)
y_pred = paste_clf.predict(X_test)

print("accuracy score of pasting classifier is: ", accuracy_score(y_test, y_pred))


accuracy score of bagging classifier is:  0.904
 
accuracy score of pasting classifier is:  0.92


In [6]:
# Without bagging accuracy score
tree_clf = DecisionTreeClassifier(random_state=42)



# Calculate "Tree Classifier" accuracy score
# Using X_test dataset calculate/print Tree Classifier Accuracy Score (using accuracy_score metric)

tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))


0.856


## The Pasting Classifier is has higher accuracy than the bagging and decidion trees classifier.


# TODO 2.2.2 Out-of-Bag evaluation
- With bagging, some instances may be sampled several times for any given predictor, 
 while others may not be sampled at all. 
- By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), 
 where m is the size of the training set
- Only about 60% of the training instances are sampled on average for each predictor
- The remaining 40% of the training instances that are not sampled are called out-of-bag (oob) instances
- Since a predictor never sees the oob instances during training, it can be evaluated on these instances, 
without the need for a separate validation set or cross-validation
- In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier
 to request an automatic oob evaluation after training

In [7]:
#write your out-of-bag (oob) classifier and train it. 
 
oob_bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42, oob_score=True)

oob_bag_clf.fit(X_train, y_train)
y_pred_oob = oob_bag_clf.predict(X_test)


# According to this oob evaluation print your oob score for test dataset 
oob_bag_clf.oob_score_
print("OOB score is: ", oob_bag_clf.oob_score_)



# Calculate oob bagging accuracy score
from sklearn.metrics import accuracy_score
print("accuracy score of OBB  is: ", accuracy_score(y_test, y_pred_oob))




OOB score is:  0.9253333333333333
accuracy score of OBB  is:  0.904


# TODO 2.2.4 Random Forests
- Random Forest is an ensemble of Decision Trees
- Generally trained via the bagging method typically with max_samples set to the size of the training set
- Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, 
you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees

- Train a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores:

In [8]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier


# Write your RandomForestClassifier classifier using sklearn RandomForestClassifier class and train it. 
rnd_clf =  RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)


# Calculate/print the prediction accuracy of the rnd_clf for Test Data


rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
print("accuracy score of Random  classifier is: ", accuracy_score(y_test, y_pred_rf))

print(" ")

# Write your RandomForestClassifier classifier using BaggingClassifier equivalent and train it.(estimator=500, leafnode=16) 
bag_rnd_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features = "sqrt", max_leaf_nodes=16),n_estimators=500,
    n_jobs=-1, random_state=42, bootstrap=True)

# Calculate/print the prediction accuracy of the bag_rnd_clf for Test Data

bag_rnd_clf.fit(X_train, y_train)
y_pred_rf_bag = bag_rnd_clf.predict(X_test)
print("accuracy score of Bag Random  classifier is: ", accuracy_score(y_test, y_pred_rf_bag))





accuracy score of Random  classifier is:  0.912
 
accuracy score of Bag Random  classifier is:  0.912


# TODO 2.2.5 Write Your Novel Ensemble Model 
- Write your customized Ensemble Classifier written by yourself 
- Try to get the highest score for the same dataset
- There no limit. There is no specific constraints. Note that the classifier you write should be an Ensemble Classifier. 

Note: Students with the highest score on the assignment will be awarded an additional 10 points as an assignment score. All submitted scores will be ranked in descending order and the top 5 students will be awarded an additional +10 points for WH2.2.

In [9]:
# Write your Ensemble Classifier and train it. 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier


clf1 = GradientBoostingClassifier(max_depth =  3, n_estimators=15, random_state = 40)
clf2 = KNeighborsClassifier(n_neighbors = 5)
clf3 = DecisionTreeClassifier(random_state = 40, max_depth = 3)
clf4 = RandomForestClassifier(n_estimators=30,  max_depth =  5,  random_state = 42)
clf5 = AdaBoostClassifier()

my_ensemble_clf =  VotingClassifier(
    estimators=[ ('gb', clf1), ('knn', clf2), ('dt', clf3), ('rf', clf4), ('ada', clf5)],
    voting='hard')

my_ensemble_clf.fit(X_train, y_train)

# Calculate accuracy score
from sklearn.metrics import accuracy_score


for model in ( clf1, clf2, clf3, clf4,clf5, my_ensemble_clf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("My Ensemble classifiers result: ", model.__class__.__name__, accuracy_score(y_test, y_pred))
print(" ")





My Ensemble classifiers result:  GradientBoostingClassifier 0.92
My Ensemble classifiers result:  KNeighborsClassifier 0.912
My Ensemble classifiers result:  DecisionTreeClassifier 0.896
My Ensemble classifiers result:  RandomForestClassifier 0.912
My Ensemble classifiers result:  AdaBoostClassifier 0.904
My Ensemble classifiers result:  VotingClassifier 0.928
 


## My Ensemble classifier consists of 6 classifiers with teh hard voting, and its accuracy score is 92.8%