# Ensemble Learning

# Description

The aim of this assignment is to use Ensemble Learning to solve a problem. In this way, it is aimed to understand the benefits of Ensemble Learning and to teach the usage details of different ensemble approaches with the help of ScikitLearn library. 

The following methods will be implemented within the scope of this assignment. These are:
- Voting Classifier (hard & soft voting) 


In [9]:
# Common imports
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

In [10]:
# to make this notebook's output stable across runs
np.random.seed(42)

In [11]:
# generate moon dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#  Voting classifier in Scikit-Learn
- Train voting classifiers in Scikit-Learn, composed of at least three diverse classifiers


In [14]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

clf1 = LogisticRegression(random_state=42)
clf2 = RandomForestClassifier(random_state=42)
clf3 = SVC(probability=True, random_state=42)

clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
clf3.fit(X_train, y_train)

# define the hard voting classifier
voting_clf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
voting_clf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')

# train the voting classifier using X_train, y_train
voting_clf1.fit(X_train, y_train)
voting_clf2.fit(X_train, y_train)

#write obtained individual classifiers. Compare them with "Voting Classifiers (hard and soft)" for "X_test/y_test" data
from sklearn.metrics import accuracy_score

for clf in (clf1, clf2, clf3, voting_clf1, voting_clf2):
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912
VotingClassifier 0.92


# Bagging/Pasting
- One way to get a diverse set of classifiers is to use very different training algorithms
- Another approach is to use the same training algorithm for every predictor, but to train them on different random subsets of the training set
    - When sampling is performed with replacement, this method is called ***bagging*** 
        - short for ***bootstrap aggregating***
    - When sampling is performed without replacement, it is called pasting
- Both bagging and pasting allow training instances to be sampled several times across multiple predictors
- Only bagging allows training instances to be sampled several times for the same predictor
- Predictors can all be trained in parallel, via different CPU cores or even different servers.
- Similarly, predictions can be made in parallel.
- This is one of the reasons why bagging and pasting are such popular methods: they scale very well.

In [17]:
# Scikit-Learn offers a simple API for both bagging and pasting with the BaggingClassifier class 
#  
#   (this is an example of bagging, but if you want to use pasting instead, just set bootstrap=False). 
# The n_jobs parameter tells Scikit-Learn the number of CPU cores to use for training and predictions 
#   (–1 tells Scikit-Learn to use all available cores):

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Write Bagging Classifier and train it using ensemble of 500 Decision Tree classifiers, 
#  each trained on 100 training instances randomly sampled from the training set with replacement 
tree_clf = DecisionTreeClassifier(random_state=42)
bag_clf = BaggingClassifier(base_estimator=tree_clf, n_estimators=500,
                            bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)

# bagging accuracy score
# Using X_test dataset calculate/print Bagging Accuracy Score (using accuracy_score metric)
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
bag_accuracy = accuracy_score(y_test, y_pred)
print("Bagging accuracy:", bag_accuracy)

Bagging accuracy: 0.92


In [18]:
# Without bagging accuracy score
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

# Calculate "Tree Classifier" accuracy score
# Using X_test dataset calculate/print Tree Classifier Accuracy Score (using accuracy_score metric)

y_pred = tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, y_pred)
print("Decision tree accuracy:", tree_accuracy)

Decision tree accuracy: 0.856



# Out-of-Bag evaluation
- With bagging, some instances may be sampled several times for any given predictor, 
 while others may not be sampled at all. 
- By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), 
 where m is the size of the training set
- Only about 60% of the training instances are sampled on average for each predictor
- The remaining 40% of the training instances that are not sampled are called out-of-bag (oob) instances
- Since a predictor never sees the oob instances during training, it can be evaluated on these instances, 
without the need for a separate validation set or cross-validation
- In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier
 to request an automatic oob evaluation after training

In [19]:
#write your out-of-bag (oob) classifier and train it. 
tree_clf = DecisionTreeClassifier(random_state=42)
oob_bag_clf = BaggingClassifier(base_estimator=tree_clf, n_estimators=500,
                                bootstrap=True, oob_score=True, n_jobs=-1, random_state=42)
oob_bag_clf.fit(X_train, y_train)

# According to this oob evaluation print your oob score for test dataset 
oob_bag_clf.oob_score_
print("OOB score:", oob_bag_clf.oob_score_)

# Calculate oob bagging accuracy score
from sklearn.metrics import accuracy_score
y_pred = oob_bag_clf.predict(X_test)
oob_accuracy = accuracy_score(y_test, y_pred)
print("OOB accuracy:", oob_accuracy)

OOB score: 0.896
OOB accuracy: 0.92


# Random Forests
- Random Forest is an ensemble of Decision Trees
- Generally trained via the bagging method typically with max_samples set to the size of the training set
- Instead of building a BaggingClassifier and passing it a DecisionTreeClassifier, 
you can instead use the RandomForestClassifier class, which is more convenient and optimized for Decision Trees

- Train a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores:

In [20]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Write your RandomForestClassifier classifier using sklearn RandomForestClassifier class and train it. 
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

# Calculate/print the prediction accuracy of the rnd_clf for Test Data
y_pred = rnd_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

# Write your RandomForestClassifier classifier using BaggingClassifier equivalent and train it.(estimator=500, leafnode=16) 
bag_rnd_clf = BaggingClassifier(
    DecisionTreeClassifier(max_leaf_nodes=16), n_estimators=500, n_jobs=-1
)
bag_rnd_clf.fit(X_train, y_train)

# Calculate/print the prediction accuracy of the bag_rnd_clf for Test Data
y_pred = bag_rnd_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)


Accuracy:  0.912
Accuracy:  0.912


# Write Your Novel Ensemble Model 
- Write your customized Ensemble Classifier written by yourself 
- Try to get the highest score for the same dataset
- There no limit. There is no specific constraints. Note that the classifier you write should be an Ensemble Classifier. 

In [21]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Write your Ensemble Classifier and train it. 
log_clf = LogisticRegression()
svm_clf = SVC()
dt_clf = DecisionTreeClassifier()

my_ensemble_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)], voting='hard'
)
my_ensemble_clf.fit(X_train, y_train)

# Calculate accuracy score
from sklearn.metrics import accuracy_score
y_pred = my_ensemble_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.896
