# Voting classifiers

Suppose you have trained 5 classifiers to achive 80% accuracy. Agrregate predictions of each using simple voting. Even if the underlying classifiers are weak, the voting still provides a strong learner (high accuracy). This is akin to [Wisdom of the Crowd](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd). The key for it work is each predictor should be independent and trained on different alorithms. Or use different training dataset. They will make different errors and increase accuracy.

- Hard Voting: Use maximum of the votes
- Soft voting: use the probabilities of the underlying models and averge them. The ude the highest probability to get prediction. 

## Different Algorithms

This example explores different alogrithms trained on same data.

### VotingClassifier

 - Type: Averaging Ensemble
 - Hard Voting: $\hat{y} = \arg\max_{c \in C} \sum_{i=1}^{n} \mathbb{I}(\hat{y}i = c)$ 
 - Soft Voting: $\hat{y} = \arg\max{c \in C} \sum_{i=1}^{n} w_i \cdot P_i(c)$

In [1]:
from sklearn.metrics import accuracy_score
import numpy as np

In [2]:
# Create a dataset
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42)

# Ceate a hard voting classifier
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

# Predict and score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print("Accuracy of " + clf.__class__.__name__ +": ", accuracy_score(y_test, y_pred))

Accuracy of LogisticRegression:  0.864
Accuracy of RandomForestClassifier:  0.896
Accuracy of SVC:  0.896
Accuracy of VotingClassifier:  0.912


# Different Datasets

## Bagging Classifiers
This example explores same alogrithms trained on different data. Create samples either using replacement (=bootstrap) called __bagging__ or no-replacement called __pasting__. The replacement means that the sample kept back in the bag so can be chosen again. So bagging may produce duplicate training samples. 

So for a training sample size $m$ the probability that is NOT selected in a draw is $ 1 - \frac{1}{m}$ and in $m$ draws is $ (1 - \frac{1}{m})^m$. So $\lim\limits_{m \to \infty}(1 - \frac{1}{m})^m = \frac{1}{e} \approx 37\%$. These 37% are caled out-of-bag samples, note that these are not same for all predictors. We can set `oob_score=True` for `BaggingClassifier()` to do a evaluation. It averages the oob score for each predictor to arrive at a final oob-score.

To sample features you can use `max_features` and `bootstrap_features`.

- Type: Bagging (Bootstrap Aggregation) 
- Objective: $\hat{f}(x) = \frac{1}{T} \sum_{t=1}^{T} h_t(x)$ (No loss function minimized; variance reduction through averaging)




In [4]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Default voting is soft, train 500 decision trees and use 100 samples in training each
# tree with replacement bootstrap=True, random_state=42
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_leaf_nodes=16, max_features="sqrt"),
    n_estimators=500,
    max_samples=100,
    random_state=42,
    bootstrap=True,
    oob_score=True
).fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print("Accuracy of bagging classifier: ", accuracy_score(y_test, y_pred))
print(f"OOB score: {bag_clf.oob_score_:0.3f}")

Accuracy of bagging classifier:  0.92
OOB score: 0.925


In [5]:
# Individual decision trees
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print("Accuracy of decision tree: ", accuracy_score(y_test, y_pred_tree))

Accuracy of decision tree:  0.856


## RandomForestClassifier

Is an out-of-the-box optimized Bagging Classifier for ensemble of `DecisionTreeClassifier`. It has all hyper-parameters of DecisionTreeClassifier (tree-growing controls) and BaggingCLassifier (controling ensemble). It adds extra randomness in growing trees. It searches for best feature among a random subset of features per split. It also does a 
stronger de-correlation via row-and-coliumn sampling. Where as BaggingClassifier does not do both of these.

It results in hugher tree diversity which trades higher bias for lower variance. Generaly yields better overall model. 

__Note:__ `feature_importances_` contains name and normalized importance of the features. 

- Type: Bagging + Random Feature Selection
- Split Criterion: Gini: $Gini(p) = 1 - \sum_{k=1}^{K} p_k^2$ 
- Entropy: $Entropy = - \sum p_k \log p_k$

In [6]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500,
                                 max_leaf_nodes=16,
                                 max_samples=100,
                                 random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
print("Accuracy of random forest: ", accuracy_score(y_test, y_pred_rf))

Accuracy of random forest:  0.92


In [7]:
# Compare the predictions of the BaggingClassifier and RandomForestClassifier, 100% match
np.sum(y_pred == y_pred_rf) / len(y_pred)

np.float64(1.0)

In [8]:
# First 5 predictions
y_pred[:5], y_pred_rf[:5]

(array([0, 0, 0, 1, 1]), array([0, 0, 0, 1, 1]))

In [9]:
# Normalized feature importance
rnd_clf.feature_importances_, np.sum(rnd_clf.feature_importances_)

(array([0.42755541, 0.57244459]), np.float64(1.0))