# Ensemble Classifier
It helps improve machine learning by combining various models. This approach allows the production of better predictive performance compared to a single model. The idea is to train a bunch of experts(classifiers) and let them vote.

## Bagging(Bootstrapped Aggregation)
It is used to reduce the variance of decision tree

<p>
    Implementation steps of Bagging –
    <ul>
        <li>Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement.</li>
        <li>A base model is created on each of these subsets.</li>
        <li>Each model is learned in parallel from each training set and independent of each other.</li>
        <li>The final predictions are determined by combining the predictions from all the models.</li>
    </ul>
</p>

## Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak learners to create a strong learner. The ensemble of weak models are trained in series such that each model that comes next, tries to correct errors of the previous model until the entire training dataset is predicted correctly. One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting). 

Lets see an implementation of voting ensemble. Here we use same dataset for different models and then they vote for the outcome.

In [1]:
# Importing libraries
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
iris = sns.load_dataset("iris")

In [3]:
species_map = {"setosa": 1, "versicolor": 2, "virginica": 3}
get_species_name = {val: key for key, val in species_map.items()}

In [4]:
iris.species = iris.species.map(species_map)

In [5]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1


In [6]:
iris.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [7]:
new_iris = iris[iris["species"] != 1][["sepal_length", "sepal_width", "species"]]

In [8]:
new_iris.head()

Unnamed: 0,sepal_length,sepal_width,species
50,7.0,3.2,2
51,6.4,3.2,2
52,6.9,3.1,2
53,5.5,2.3,2
54,6.5,2.8,2


In [9]:
X = new_iris.iloc[:, :-1]
y = new_iris.iloc[:,-1]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [11]:
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = KNeighborsClassifier()

In [12]:
estimators = [("lr", clf1), ("rf", clf2), ("knn", clf3)]

In [13]:
for estimator in estimators:
    val = cross_val_score(estimator[1], X, y, cv=10, scoring="accuracy")
    print(estimator[0], np.round(np.mean(val), 2))

lr 0.75
rf 0.6
knn 0.61


In [14]:
vc = VotingClassifier(estimators=estimators)
val = cross_val_score(vc, X, y, cv=10, scoring="accuracy")
print("VC-hard", np.round(np.mean(val), 2))

VC-hard 0.68


In [15]:
vc1 = VotingClassifier(estimators=estimators, voting="soft")
val = cross_val_score(vc1, X, y, cv=10, scoring="accuracy")
print("VC-soft", np.round(np.mean(val), 2))

VC-soft 0.66
