# Introduction

Wisdom of the crowd or averaging results or taking a majority vote is way better than going for a single best thing. That is why ensemble prediction is way better than a single models output.
A group of predictors is called an ensemble.


# Voting Classifiers

When we have multiple predictors with around 80% accuracy. We can be sure that false positive and negatives of all these predictors are not common, so just create a majority voting between them will result in a model that is correct more than 80% of the time. This is called hard voting.
<br>This works best when the predictors are as diverse as possible because they will make different mistakes than the others.

<img src="img.PNG">
<img src="img2.PNG">

## Hard Voting

In [2]:
# Example
import pandas as pd
from sklearn import datasets
data = datasets.load_breast_cancer()

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver = 'lbfgs',max_iter=100000)
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

print(data.data.shape)

X_train = data.data[:500]
y_train = data.target[:500]

X_test = data.data[500:]
y_test = data.target[500:]

voting_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

(569, 30)
LogisticRegression 0.9710144927536232
RandomForestClassifier 1.0
SVC 0.9420289855072463
VotingClassifier 0.9855072463768116


## Soft Voting

If the predictors are able to predict class probabilties instead of 0 or 1, we can call upon soft voting where probabilties are averaged out for all the predictors and the highest one is selected. Gives better result than hard voting.

In [95]:
from sklearn.naive_bayes import GaussianNB

clf1 = LogisticRegression(multi_class='multinomial', random_state=1,max_iter=100000)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

eclf3 = VotingClassifier(estimators=[
       ('lr', clf1), ('rf', clf2), ('gnb', clf3)],
       voting='soft', weights=[2,1,1],
       flatten_transform=True)

eclf3.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
for clf in (clf1, clf2, clf3, eclf3):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9710144927536232
RandomForestClassifier 0.9855072463768116
GaussianNB 0.9710144927536232
VotingClassifier 0.9710144927536232


In [2]:
# Example
log_clf = LogisticRegression(solver = 'lbfgs',max_iter=100000)
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability = True)

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='soft')

voting_clf.fit(X_train, y_train)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9710144927536232
RandomForestClassifier 0.9855072463768116
SVC 0.9420289855072463
VotingClassifier 0.9710144927536232


# Bagging and Pasting

Instead of using multiple algorithms on the same dataset to give better accuracy, we can use the same algorithm and train it on random subsets of dataset to produce predictors.
<br>When these subsets are made with replacement they are called bagging(boostrap aggregating) and when they are made without replacement it is called pasting.

<img src="img3.PNG">

The output is aggregated as a majority vote in case of classification and an average in case of regression.
<br> Individual predictors have a huge bias but the aggregated result has a reduced bias and variance.

## Skit-learn example

In [4]:
# Example of Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),  # model to be used
    n_estimators=500,  # number of predictors
    max_samples=0.7, # number of rows in the subset, float between 0 and 1 signifies the portion of the dataset to be sampled
    bootstrap=True,  # using bagging
    n_jobs=-1  # using all the cores
    )
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9855072463768116

In [17]:
# Example of Pasting
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),  # model to be used
    n_estimators=500,  # number of predictors
    max_samples=0.7, # number of rows in the subset, float between 0 and 1 signifies the portion of the dataset to be sampled
    bootstrap=False,  # using pasting
    n_jobs=-1  # using all the cores
    )
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9710144927536232

### Even though the accuracy remains same between 1 tree vs 500 trees, the descision boudary seems much more generalized
<img src="img4.PNG">

## Out of bag Evaluation

While training the DT's on a subset of data, the remaining data is called out of bag dataset and the DT is not trained on this so can be leveraged to calculate accuracy. Used when the training dataset is not big enough, to not keep a validaion dataset aside and reduce the length of the dataset.
<br> If one row is taken from OOB sample and tested on the DT's which were not trained on this. They will return an accuracy of how many of them were able to predict it correctly.

In [38]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                            n_estimators=500,
                            bootstrap=True,
                            n_jobs=-1,
                            oob_score=True  # calculating the oob_score for each predictor
                           )
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

 This OOB score when observed for individual DT can be seen in the decision function.

In [39]:
bag_clf.oob_decision_function_

array([[0.86734694, 0.13265306],
       [0.99375   , 0.00625   ],
       [1.        , 0.        ],
       [0.88636364, 0.11363636],
       [0.83146067, 0.16853933],
       [0.69886364, 0.30113636],
       [1.        , 0.        ],
       [0.96216216, 0.03783784],
       [0.93641618, 0.06358382],
       [0.94472362, 0.05527638],
       [0.78977273, 0.21022727],
       [1.        , 0.        ],
       [0.85416667, 0.14583333],
       [0.6402439 , 0.3597561 ],
       [0.92857143, 0.07142857],
       [1.        , 0.        ],
       [0.97252747, 0.02747253],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.01587302, 0.98412698],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.79885057, 0.20114943],
       [0.9947644 , 0.0052356 ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.97142857, 0.02857143],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.78333333, 0.21666667],
       [1.

OOB score is calculated using a subset of DT's

In [40]:
bag_clf.oob_score_  # this is likable accuracy for future test dataset 

0.96

# Random patches and Random Subspaces

**Random Patches** - When both the training instances(rows) and training features are varied. Useful in images with high dimesionality
<br>**Random Subspaces** - Only number of features is varied and all the rows are taken into account

# Random Forests

Decision trees trained on the random subsets when ensembled together using bagging to give one output.
<br> We can use random forest regressor for regression with the same technique

In [44]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

Unlike DT, RF searches the feature to split a node from a subset of features and not all of them. This increases bias and reduce variance of the model.

## Extra-Trees

In DT, the nodes are broken down using the best possible threshold. When we randomize this threshold in a random forest we get Extremely Randomized Trees(Extra-Trees). This is faster than RF as time consumed to find the threshold is saved.

## Feature importance

In [45]:
for name, score in zip(data["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

mean radius 0.040822706311416025
mean texture 0.014522751655622114
mean perimeter 0.051239837392426685
mean area 0.041385647557718286
mean smoothness 0.005554177731460974
mean compactness 0.015323090716651987
mean concavity 0.04629303062964822
mean concave points 0.09864868398068713
mean symmetry 0.0028658193132507716
mean fractal dimension 0.004581219074849208
radius error 0.014525286295540833
texture error 0.0035896656859457417
perimeter error 0.01377513734700074
area error 0.03382974100438362
smoothness error 0.003545307096423924
compactness error 0.0035029955846609443
concavity error 0.005170196620826863
concave points error 0.004836041458759045
symmetry error 0.0033436996327992526
fractal dimension error 0.00489638439827724
worst radius 0.10404617374495274
worst texture 0.015579194675895663
worst perimeter 0.14855747319295562
worst area 0.106203294137047
worst smoothness 0.017652388374837766
worst compactness 0.011363815775368514
worst concavity 0.0306504027416024
worst concave po

worst radius, perimerter, area, concave points are higly important features in this forest. The features closer to roots are important and the ones closer to leaves are unimportant.

# Boosting

Ensemble method that can combine several weak learners to make a strong learner in a sequential manner correcting its predecessors mistakes.

## Adaboost

Technique where model weights predictor weights are changed sequentially to fit the misclassified training instances. Thus, focusing on misclassified instances.
<img src="img5.PNG">
After training all the predictors are used to give a vote, with a weight attached to each of their vote, based on their training accuracy.
<br><br><br>
**Explaining Algorithm**
<br>Let there be 100 instances for which the model needs to be trained. Initially set weights for all of them as 1/100.
<br>Evaluate your model and come up with this score.
<img src="img6.PNG">
<br>This error rate is evaluated for a predictor. Very obviously if there is some weight that needs to be given to the predictor, this quantity should be inversely related.
<img src="img7.PNG">
This is the weight that is directly attached to the predictor while bagging the final predictions. Here, N is the learning rate.
<br>Finally the weights of those 100 rows are updated using the below equation.
<img src="img8.PNG">
If the output is giving good accuracy the alpha will be greater and more focus in fixing weights for misclassified rows. All the weights are normalized back to 1 and this process will run in a loop until the desired predictors are reached or perfect predictor is found.
<br>The prediction is made based on majority vote by their weighted predictors.

In [7]:
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),  # also called as decision stump, 1 node 2 leaves
    n_estimators=200,
    algorithm="SAMME.R",  # Stagewise Additive Modeling using a Multiclass Exponential loss function.
                          # R(Real) when predict_proba() can be calculated
    learning_rate=0.5
    )
ada_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9855072463768116

## Gradient Boosting

# Stacking