# Ensemble Learning and Random Forests

## Voting Classifiers
These classifiers constist of multiple models. Their predictions are compared and the final predicition is the most common one (<i>hard voting classifier</i>).

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svc", svm_clf)],
    voting="hard"
)

voting_clf.fit(X_train, y_train)

In [4]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.904


If each classifier uses predict_proba() method, we can use <i>soft voting</i>. Prediction will be done using the probabilities values - bigger probabilities will have bigger weights.

In [7]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svc", svm_clf)],
    voting="soft"
)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


## Bagging and Pasting
Different approach is to train each predictior with the same algorithm but with different training subsets. If the sampling is done with replacement, this mechanism is called bagging (short from bootstrap aggregating). If the sampling is done without replacement, then this mechanism is called pasting. In other words: both mechanism allow us to sample using many training examples with all of the predictios, but only the first mechnism allow us to feed one predictior multiple times. (with replecement - feed one predictior multiple times)

In [8]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1
)

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.904

### OOB score
While aggregating, about 63% of the trainign seet is never seen by a predictor. This data is called Out-Of-Bag instances - OOB. Becouse they are not used during learning, we can use them to evaluate the model. We can also measure the score of whole ensemble by calculating the mean of all OBB scores.

In [10]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    oob_score=True
)

bag_clf.fit(X_train, y_train)

bag_clf.oob_score_


0.928

In [11]:
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.904

In [13]:
bag_clf.oob_decision_function_

array([[0.31216931, 0.68783069],
       [0.39473684, 0.60526316],
       [1.        , 0.        ],
       [0.00518135, 0.99481865],
       [0.02313625, 0.97686375],
       [0.10498688, 0.89501312],
       [0.44297082, 0.55702918],
       [0.06266319, 0.93733681],
       [0.94320988, 0.05679012],
       [0.85340314, 0.14659686],
       [0.58823529, 0.41176471],
       [0.06940874, 0.93059126],
       [0.73684211, 0.26315789],
       [0.85714286, 0.14285714],
       [0.93830334, 0.06169666],
       [0.07397959, 0.92602041],
       [0.03157895, 0.96842105],
       [0.9144385 , 0.0855615 ],
       [0.6984127 , 0.3015873 ],
       [0.96231156, 0.03768844],
       [0.05744125, 0.94255875],
       [0.285     , 0.715     ],
       [0.87373737, 0.12626263],
       [0.9924812 , 0.0075188 ],
       [0.9673913 , 0.0326087 ],
       [0.00533333, 0.99466667],
       [0.96391753, 0.03608247],
       [1.        , 0.        ],
       [0.02331606, 0.97668394],
       [0.71688312, 0.28311688],
       [0.